Click to edit Master text styles
Second level
Third level
Fourth level
Fifth level
The title of this talk is
“Representation, Reasoning, Learning”.
These are three of the cornerstones of artificial intelligence, and are
critical components in the design of an intelligent agent. Our natural tendency to form communities has
often resulted in treating these tasks as separate endeavors. One of my goals in this talk is to show that
we have much to gain from viewing them as three parts of a single endeavor,
and to provide one framework in which they all fit together.
In the past 50 years, AI has come a long
way. We have achieved tremendous
successes on a wide variety of tasks, such as scheduling, autonomous
navigation, chess, and many more.
This list of applications was a biased
selection. It was selected to celebrate
the fact that this is 2001, and we have achieved many of the goals for this
year, as set forth by Stanley Kubrick in his famous movie, 2001 Space Odyssey.
But if this is 2001, where is Hal?
Unfortunately, the Hal project has fallen behind schedule. In this talk, I
would like to argue that by putting together some of the work done over the
last 50 years, we can get one step closer towards building a Hal.
My first building block is
representation, the process by which the agent encodes its knowledge about the world.
There are many design decisions one has
to make when designing a representation language. One of the most important involves the
extent to which the representation encodes the rich structure in the world,
and makes it transparent to the reasoning and learning algorithms. Along this
spectrum, there are three common choices, which I will discuss in turn.
In the most basic representations, the
agent views each possible state of the world as an indivisible atom. Any structure in the state of the world is
hidden inside a black box, and algorithms can only ask certain well-specified
questions, such as “is this state a check-mate”. In this case, algorithms must do their work
by enumerating individual states.
Unfortunately, the state space in most problems is very large.
A somewhat more transparent representation is based on attributes or
propositions. Here, the representation
makes an explicit commitment
that the world is described by a set of attributes or (in the binary case)
propositional symbols.
The knowledge
base tells us how these different symbols interact with each other.
Semantically, each possible world that the
agent considers is an assignment of values to these attributes.
This more transparent representation gives
us a lot more power.
For example, using
logical inference techniques we can often conclude that certain propositions
must take a specific truth value, without enumerating the entire space of
truth assignments.
However, propositional reasoning has its own problems.
In many cases, it requires that we repeat
what is basically the same piece of information many many times.
For example, if we wanted to write down the
rules of chess (just those governing the single-move successor function), we
would have to restate the rules for pawn movement separately for every board
position.
In fact, this process would
require about 10^7 axioms.
I can’t write
even one of these axioms down for you, because it would be too long, so let me
use a simpler example, one which will be my running example for most of this
talk.
The reason for the tremendous
representation inefficiency of propositional logic is that it does not contain
the concept of an object. When we think
about the world, we always think about objects --- people, cars, cities, even
talks --- are viewed as separate entities.
Each of these has its own properties, or attributes. And, they are all linked to each other via a
complex network of interactions. A yet
more transparent representation is relational logic. In this case, the possible worlds specify a
set of objects, their properties, and the relations between them. By making objects explicit, we can now
generalize over them, specifying universal patterns that apply to all
objects. This allows us to state many
things at once, and provide a much more compact representation. In particular, the rules of chess now
require only about 100 axioms.
Many of the standard methods can be categorized along this axis. But, there is also a second important design
decision --- how do we
represent the epistemic state of the agent, its beliefs about the
world. Two common choices are
categorical, or deterministic, beliefs, and
probabilistic beliefs.
In a categorical representation, each
assertion must fall into one of three categories: necessarily true,
impossible, or “who knows”. This forces us into a very crude approximation of
our knowledge. We can either write down
a meaningful universal, which is typically false. Or a true universal, which is typically
useless. We need something that
expresses the finer shades of possibility and probability.
One particularly successful choice is
provided by the probabilistic framework.
Here, like in the categorical case, our semantics is based on a set of
possible worlds. For example, here our
worlds correspond to the difficulty of the course, the intelligence of the
student, and the grade the student received in the course. Note that one of our worlds has a smart
student getting a C in an easy class.
But, unlike the categorical case, we can
also ascribe likelihoods to each world, by associating each with a
probability, allowing us to make it improbable but not impossible for a smart
student to get a C.
Thus, we can see that it is unlikely for
a smart student to get a C in an easy class.
And the same model can be used to tell us that, although it is possible
for a C student to be smart, it is not very likely.
One of the key benefits of the
propositional representation was our ability to represent our knowledge
without explicit enumeration of the worlds.
In turns out that we can do the same in the probabilistic framework.
The key idea, introduced by Pearl in the
Bayesian network framework, is to use locality of interaction.
This is an assumption which seems to be a
fairly good approximation of the world in many cases.
This representation supports many interesting reasoning patterns.
For example, if we observe that the student
gets an A, the probability that he is smart goes up.
But if we then find out that the class was
easy, it reduces the impact of this evidence, and our probability goes down,
although not all the way.
However, this representation suffers
from the same problem as other propositional representations. We have to create separate representational
units (propositions) for the different entities in our domain. And the problem is that these instances are
not independent. For example, the
difficulty of CS101 in one network is the same difficulty as in another, so
evidence in one network should influence our beliefs in another.
Several other representations now fall
neatly into our axes. But there seems
to be an intriguing hole.
AI has witnessed a long standing
“conflict”: logic versus probability. I
would like to claim that these are complementary approaches, not conflicting
ones, and that we can gain a lot by putting them together.
Let us consider an imaginary university
called St. Nordaf. St. Nordaf has two
faculty, two students, two courses, and three registrations of students in
courses, each of which is associated with a registration record. These objects are linked to each other:
professors teach classes, students register in classes, etc. Each of the objects in the domain has
properties that we care about.
St. Nordaf would like to do some
reasoning about its university, so it calls Acme Consulting Company. Acme tells them that the first thing they
have to do is to describe their university in some organized form, i.e., a
relational database. The schema of a
database tells us what types of objects we have, what are the attributes of
these objects that are of interest, and how the objects can relate to each
other.
The possible worlds for St. Nordaf
consist of all possible assignments of values to all attributes of all objects
in the domain. A probabilistic model
over these worlds would tell us how likely each one is, allowing us to do some
inferences about how the different objects interact, just as we did in our
simple Bayesian network.
Unfortunately, this is a very large number of worlds, and to specify a
probabilistic model we have to specify a probability for each one.
Furthermore, the resulting distribution would be good only for a limited
time.
If St. Nordaf hired a new faculty
member, or got a new student, or even if the two students registered for
different classes next year,
it would
no longer apply, and St. Nordaf would have to pay Acme consulting all over
again.
Thus, we want a model that holds
for the infinitely many potential universities that hold over this very simple
schema.
Thus, we are stuck with what
seems to be an impossible problem.
How
do we represent an infinite set of possible distributions, each of which is by
itself very complex.
The two key ideas that come to our
rescue derive from the two approaches that we are trying to combine.
From relational logic, we have the notion of
universal patterns, which hold for all objects in a class.
From Bayesian networks, we have the notion
of locality of interaction, which in the relational case has a particular
twist:
Links give us precisely a notion
of “interaction”, and thereby provide a roadmap for which objects can interact
with each other.
In this example, we have a template, like a universal quantifier for a
probabilistic statement.
It tells
us:
“For any registration record in my
database, the grade of the student in the course depends on the intelligence
of that student and the difficulty of that course.”
This dependency will be instantiated for
every object (of the right type) in our domain.
It is also associated with a conditional
probability distribution that specifies the nature of that dependence.
We can also have dependencies over several
links, e.g., the satisfaction of a student on the teaching ability of the
professor who teaches the course.
The semantics of this model is as
something that generates an appropriate probabilistic model for any domain
that we might encounter. The
instantiation is the embodiment of universals on the one hand, and locality of
interaction, as specified by the links, on the other.
Note that the objects in the resulting
network are not independent. The links
between objects induce direct correlations, and indirect correlations continue
to propagate, leading to a complex web of influence, that ties everything in
the domain together. While this seems complicated,
this is actually a good picture of what life is like. As we live our lives, we interact with each
other, and with the world around us.
These interactions often influence us, causing us to change.
This web of influence has interesting
ramifications from the perspective of the types of reasoning patterns that it
supports. Consider Forrest Gump. A priori, we believe that he is pretty
likely to be smart. Evidence about two
classes that he took changes our probabilities only very slightly. However, we see that most people who took
CS101 got A’s. In fact, even people who
did fairly poorly in other classes got an A in CS101. Therefore, we believe that CS101 is probably
an easy class. To get a C in an easy
class is unlikely for a smart student, so our probability that Forrest Gump is
smart goes down substantially.
This web of influence has interesting
ramifications from the perspective of the types of reasoning patterns that it
supports. Consider Forrest Gump. A priori, we believe that he is pretty
likely to be smart. Evidence about two
classes that he took changes our probabilities only very slightly. However, we see that most people who took
CS101 got A’s. In fact, even people who
did fairly poorly in other classes got an A in CS101. Therefore, we believe that CS101 is probably
an easy class. To get a C in an easy
class is unlikely for a smart student, so our probability that Forrest Gump is
smart goes down substantially.
This web of influence has interesting
ramifications from the perspective of the types of reasoning patterns that it
supports. Consider Forrest Gump. A priori, we believe that he is pretty
likely to be smart. Evidence about two
classes that he took changes our probabilities only very slightly. However, we see that most people who took
CS101 got A’s. In fact, even people who
did fairly poorly in other classes got an A in CS101. Therefore, we believe that CS101 is probably
an easy class. To get a C in an easy
class is unlikely for a smart student, so our probability that Forrest Gump is
smart goes down substantially.
But this approach is more than just a
generator for infinitely many Bayesian networks. It allows us to represent distributions that
are entirely outside the scope of propositional representations: ones about
the existence of objects and links in the domain. It also supports interesting
reasoning patterns that are associated with these concepts.
Representation is not a goal in and of
itself. To be useful, a knowledge base
must allow us to get answers to interesting questions, in an effective way. Unfortunately, this is not so easy. The complexity of reasoning tasks has been a
long-standing problem in AI. It was given
a physical manifestation by Henry Kautz in his 1989 Computers and Thought
talk, and reincarnated by Stuart Russell 6 years later. Now, six years after that, I’d like to
reintroduce the complexity monster.
Representation is not a goal in and of
itself. To be useful, a knowledge base
must allow us to get answers to interesting questions, in an effective way. Unfortunately, this is not so easy. The complexity of reasoning tasks has been a
long-standing problem in AI. It was given
a physical manifestation by Henry Kautz in his 1989 Computers and Thought
talk, and reincarnated by Stuart Russell 6 years later. Now, six years after that, I’d like to
reintroduce the complexity monster.
It is well-known that the more
expressive a representation language is, the harder it is to perform inference
in the language. Clearly, the corollary
is that we should use less expressive representations. So why am I trying to sell you a language
that combines relational logic on the one hand, probabilities on the
other? Well, complexity can often be
misleading. Polynomial in the size of a
representation is great, but not if the size is itself exponential. Specifically, many tasks are linear time in
an atomic representation --- we simply enumerate the worlds. But the number of worlds is huge. Thus, more
expressive languages are not necessarily harder. Furthermore, expressive languages often
contain structure that allows us to perform tasks much more efficiently. So, by picking the “right” type of
expressive power, we can often do inference much more efficiently.
Divide and conquer inference allows us
to perform inference separately over different parts of the knowledge base,
and to combine results by sending a “message” involving only a small number of
entities, e.g. (in this simple case), those in the intersection between the parts. In general, this extends to hierarchical
structures.
Divide and conquer inference allows us
to perform inference separately over different parts of the knowledge base,
and to combine results by sending a “message” involving only a small number of
entities, e.g. (in this simple case), those in the intersection between the parts. In general, this extends to hierarchical
structures.
Divide and conquer inference allows us
to perform inference separately over different parts of the knowledge base,
and to combine results by sending a “message” involving only a small number of
entities, e.g. (in this simple case), those in the intersection between the parts. In general, this extends to hierarchical
structures.
Why do we think that this is a useful
property? Because the world does
exhibit a large amount of locality, and is often hierarchically
structured. This, in fact, was one of
the key points in Herbert Simon’s seminal essay “On the Architecture of
Complexity”. In fact, Simon continued
to argue in that essay that, to the extent that the world is not organized in
this way, we simply cannot make sense of it, so that this assumption is the
only way by which we can deal with the real world.
Our representation makes object boundaries explicit, allowing their
structure to be exploited by the algorithm.
Note that it is very important
to make the structure explicit.
Although inference algorithms can try to find decompositions on their
own, this is a hard problem by itself.
The
algorithm can gain tremendously by
having the structure made explicit by the representation.
This turns out to help a lot, but not quite enough.
Many algorithms treat the first-order
statements as a simple macro language, and generate ground propositions from
them. In 1965, Robinson was the first
to propose the idea of performing inference at the level of abstract rather
than concrete entities.
We can extend these ideas to the
probabilistic case, albeit in a more limited way than in theorem proving. Rather than reasoning separately for each
student in our domain, we do probabilistic inference for a “generic” student
x, and then assume we have the right number of copies.
We can extend these ideas to the
probabilistic case, albeit in a more limited way than in theorem proving. Rather than reasoning separately for each
student in our domain, we do probabilistic inference for a “generic” student
x, and then assume we have the right number of copies.
One remaining problem is that everything
is potentially correlated with everything else via the web of influence. So, potentially our model must include every
single entity in the world. In general,
our models can end up being unreasonably large. Is that necessary? The key idea is that many influences,
although present, are fairly “small”, and we would like to ignore them.
In the deterministic framework, it is
not clear what this means.
Either a
fact changes the answer, in which case it is relevant, or it doesn’t, in which
case it isn’t.
There is no notion of
“relevant, but only a little”.
In the
probabilistic framework, we have a well-defined semantics.
Furthermore, if we assume that the world is really stochastic, then “weak
influence” is a very common state of affairs.
One can show formally that, in many settings, things are only weakly
relevant.
This type of analysis has led
to the development of a wide range of approximate inference algorithms.
To summarize, we saw that certain key aspects of our representation gave
us a range of tools against the complexity monster.
Perhaps, if we attack the monster with all of these weapons, as well as
some other ones still to be developed, we can finally vanquish the monster.
To have a knowledge base that we can
use, we must somehow address one of the problems that has been a major
obstacle in AI, that of knowledge acquisition.
One approach is to learn a knowledge base from data. Note that this problem is somewhat different
from the learning problems that we think about most of the time. There is no clearly defined classification
or performance task. Our goal is to
learn a knowledge base for general-purpose reasoning.
To check whether we can actually learn a
knowledge base that is useful for general purpose reasoning from a real-world
database, we looked at a database of Tuberculosis patients in San
Francisco. This is a database gathered
at the TB clinic, and is used by the health-care professionals to reason about
transmission of TB and to perform TB control.
Our input was the raw database, and our
goal was to see whether we could automatically reconstruct the commonsense
knowledge that the TB health care professional has. We discovered many useful patterns, mostly
ones that are known but also some new ones.
For example, gender is correlated with HIV status. But more surprising (and more novel), it is
also correlated with the type of tuberculosis.
And there are also interactions between objects. Contacts of foreign-born patients are more
likely to have TB, but are also more likely to obtain treatment, because of
the tightness of the social structure.
Of course, data is not always so nicely
arranged for us as in a relational database.
Let us consider the biggest source of data --- the world wide web. Consider the webpages in a computer science
department. Here is one webpage, which
links to another. This second webpage
links to a third, which links back to the first two. There is also a webpage with a lot of
outgoing links to webpages on this site.
This is not nice clean data.
Nobody labels these webpages for us, and tells us what they are. We would like to learn to understand this data,
and conclude from it that we have a “Professor Tom Mitchell” one of whose
interests is a project called “WebKB”.
“Sean Slattery” is one of the students on the project, and Professor
Mitchell is his advisor. Finally, Tom
Mitchell is a member of the CS CMU faculty, which contains many other faculty
members. How do we get from the raw
data to this type of analysis?
Let us consider the more limited task of
simply recognizing which webpage corresponds to which type of entity. The most standard approach is to classify
the webpages into one of several categories: faculty, student, project, etc,
using the words in the webpage as features, e.g., using the naïve Bayes model.
A more interesting approach is based on
the realization that this domain is
relational.
It has objects that are linked to each
other.
And the links are very
meaningful in and of themselves.
For
example, a student webpage is very likely to point to his advisor’s page.
In the Kansas slide, I showed that by making
links first-class citizens in the model, we can introduce a probabilistic
model over them.
Indeed, we can
represent precisely this dependency, by asserting that the existence of a link
depends on the category of the two pages pointing to it. This allows us to
use, for example, a webpage that we are fairly sure is a student page to give
evidence about the fact that a page it points to is a faculty webpage.
In fact, they can both give evidence about
each other, giving rise to a form of “collective classification”.
Yet another place where links are useful is if we explicitly model the notion
of a directory page.
If a page is a
faculty directory, it is highly likely to point to faculty webpages.
Thus, we can use evidence about a faculty
webpage that we are fairly certain about to infer that a page pointing to it
is probably a faculty directory, and based on that increase our belief that
other pages that this page points to are also faculty pages.
This is just the web of influence applied to this domain!
Finally, we can also use properties of
the links themselves, e.g., the anchor text around the link. And that also gives us information which we
can use for classification.
The results in the previous slides were
done by training on a dataset that someone classified for us, i.e., a poor
student (or two, or three) sat for weeks and labeled each webpage as a
faculty, student, etc.
We then used
that model, unchanged, to deal with our reasoning process.
But do we really expect that our agents will
get this nicely categorized training data for every task that it would have to
consider?
Do we really want our agent
to stop learning once it hits the real world?
Fortunately, our ability to perform general purpose reasoning, as in these
examples, can also be used to bootstrap our learning.
How?
Let’s consider our university domain.
There, no-one tells us how bright students are, or how hard classes
are.
All we have are grades.
Assume that our agent starts out with a
pretty bad model for our university domain, whether elicited by hand, or
randomly selected.
We can use the model
in our standard reasoning algorithm, and figure out how likely each student is
to be intelligent, and how likely each class to be difficult.
These are not good conclusions, but they
give us one possible “filling in” for the data.
We can then us that filled in data, as if it
were real, to train a better model, using the standard learning techniques
that I mentioned earlier.
We can then
reiterate this process.
Somewhat surprisingly, it not only converges, but to something that’s
actually pretty good. In the end, the
model is pretty close, but even more impressive, the likely values of the
intelligence of the students and the difficulty of the classes are exactly
right!
I could show you how EM improves
accuracy of the learned models, but for reasons of time, I will focus on a
more interesting use of the algorithm.
Note that our algorithm learned from data where Intelligence and Difficulty
were never observed. It was essentially
inventing these concepts, i.e., defining for itself a concept of “smart”
versus “weak” students, based on the data.
Of course, in this example, the concepts were there to begin with, in
the data we generated. Does this work
in real data?
The learning algorithm automatically
discovered coherent categories in actors, e.g., “tough guys” and serious drama
actors. We had no attributes for actors
except their gender, so it only discovered these categories by seeing how the
objects interact with other objects in the domain.
I’ve discussed several key tasks in building an intelligent agent, but
they are far from being enough. Let’s
go back to Hal.
In this short video clip, we see Hal perform several important tasks.
It hears the speech, analyzes the language,
and understands it.
It also plans a
course of action in a way that is intended to achieve its goals.
Each of these is a big and important area by
itself.
In both, there has been recent
work using techniques that are similar to the ones I mentioned.
I will take about perception in a
moment.
For planning, some key tasks
include representation, reasoning, and learning.
There has been tremendous progress on all
three of these tasks in the past few years.
Let us consider the perception problem.
In perception, the world generates a percept, which is the input to the
agent. The goal in the perception task
is to figure out which possible world generated the percept. There are many different instantiations of
this problem.
However, this is a very hard problem.
In general, people have a hard time understanding how hard perception
is for an automated agent, because it is so easy for people.
It’s obvious, looking at a chair, that it is
a chair.
Why is it so hard?
To illustrate how hard this is for an agent,
here is a perception task that is also hard for people.
What is this blob?
Even if I tell you that it’s a letter, it’s
still hard to tell.
But what if I tell
you that this letter is part of this word?
This leads us directly into the idea of
probabilistic perception, where the agent has some prior knowledge about what
particular worlds are likely.
In our
example above, the entire word didn’t eliminate any of the options for the
letter, but it was simply much more likely to be an “a”.
This approach has had tremendous successes over the past 15 years, in a
variety of very different perception tasks.
This is a great idea. But what are
the current limitations? Current models
used in probabilistic perception are at the surface level.
This idea arises in many perception tasks. In vision, we can have
high-level models of activities, and of how people interact with each other,
giving us much better models for visual tracking. However, I don’t have much time left,
so I’ll spend just a couple of slides
talking
about it in the context of natural language processing.
This is a great idea. But what are
the current limitations? Current models
used in probabilistic perception are at the surface level.
This idea arises in many perception tasks. In vision, we can have
high-level models of activities, and of how people interact with each other,
giving us much better models for visual tracking. However, I don’t have much time left,
so I’ll spend just a couple of slides
talking
about it in the context of natural language processing.
Let’s look at a simple problem in
language understanding.
What kind of reasoning did we use to
disambiguate this sentence? We are just
using commonsense patterns about objects, their properties, and the links
between them. But it is all
probabilistic. If we had an appropriate
probabilistic relational knowledge base, perhaps we could use web of influence
reasoning to reach similar conclusions.
This idea, of using a relational
probabilistic language for disambiguation, was first proposed by Goldman and
Charniak.
They got stuck in a place
where AI often gets stuck --– knowledge acquisition. Where do these models
come from?
Well, perhaps learning can
come to our rescue.
Current models are at syntactic level, e.g., word-word correlations.
These contain a surprising amount of
semantic information.
But that
information is at the syntactic level, which substantially limits its
applicability.
As a simple example,
although we can learn that a “teacher” is often in the same sentence as
“train”, and we can use that to disambiguate some sentences, we won’t be able
to use it in contexts where
we have “Sarah” in a sentence, who we happen to know is a teacher from
previous sentences.
In AI, as in many communities, we have
the tendency to divide a problem into well-defined pieces, and make progress
on each one. But as we make progress,
the problems tend to move away from each other.
Part of our solution to the AI problem must involve building bridges
between the pieces, to create a unified whole.
In this talk I’ve described three bridges, which build on each other.
Perhaps, if we continue making progress and building bridges, we will
eventually be able to build a Hal.