‹header›
‹date/time›
Click to edit Master text styles
Second level
Third level
Fourth level
Fifth level
‹footer›
‹#›
The title of this talk is “Representation, Reasoning, Learning”.  These are three of the cornerstones of artificial intelligence, and are critical components in the design of an intelligent agent.  Our natural tendency to form communities has often resulted in treating these tasks as separate endeavors.  One of my goals in this talk is to show that we have much to gain from viewing them as three parts of a single endeavor, and to provide one framework in which they all fit together.
In the past 50 years, AI has come a long way.  We have achieved tremendous successes on a wide variety of tasks, such as scheduling, autonomous navigation, chess, and many more.
This list of applications was a biased selection.  It was selected to celebrate the fact that this is 2001, and we have achieved many of the goals for this year, as set forth by Stanley Kubrick in his famous movie, 2001 Space Odyssey.
But if this is 2001, where is Hal? Unfortunately, the Hal project has fallen behind schedule. In this talk, I would like to argue that by putting together some of the work done over the last 50 years, we can get one step closer towards building a Hal.
My first building block is representation, the process by which the agent encodes its  knowledge about the world.
There are many design decisions one has to make when designing a representation language.  One of the most important involves the extent to which the representation encodes the rich structure in the world, and makes it transparent to the reasoning and learning algorithms. Along this spectrum, there are three common choices, which I will discuss in turn.
In the most basic representations, the agent views each possible state of the world as an indivisible atom.  Any structure in the state of the world is hidden inside a black box, and algorithms can only ask certain well-specified questions, such as “is this state a check-mate”.  In this case, algorithms must do their work by enumerating individual states.  Unfortunately, the state space in most problems is very large.
A somewhat more transparent representation is based on attributes or propositions.  Here, the representation makes an explicit commitment
that the world is described by a set of attributes or (in the binary case) propositional symbols.  The knowledge base tells us how these different symbols interact with each other.  Semantically, each possible world that the agent considers is an assignment of values to these attributes.  This more transparent representation gives us a lot more power.  For example, using logical inference techniques we can often conclude that certain propositions must take a specific truth value, without enumerating the entire space of truth assignments.
However, propositional reasoning has its own problems.  In many cases, it requires that we repeat what is basically the same piece of information many many times.  For example, if we wanted to write down the rules of chess (just those governing the single-move successor function), we would have to restate the rules for pawn movement separately for every board position.  In fact, this process would require about 10^7 axioms.  I can’t write even one of these axioms down for you, because it would be too long, so let me use a simpler example, one which will be my running example for most of this talk.  
The reason for the tremendous representation inefficiency of propositional logic is that it does not contain the concept of an object.  When we think about the world, we always think about objects --- people, cars, cities, even talks --- are viewed as separate entities.  Each of these has its own properties, or attributes.  And, they are all linked to each other via a complex network of interactions.  A yet more transparent representation is relational logic.  In this case, the possible worlds specify a set of objects, their properties, and the relations between them.  By making objects explicit, we can now generalize over them, specifying universal patterns that apply to all objects.  This allows us to state many things at once, and provide a much more compact representation.  In particular, the rules of chess now require only about 100 axioms.
Many of the standard methods can be categorized along this axis.  But, there is also a second important design decision --- how do we
represent the epistemic state of the agent, its beliefs about the world.  Two common choices are categorical, or deterministic, beliefs, and
probabilistic beliefs.
In a categorical representation, each assertion must fall into one of three categories: necessarily true, impossible, or “who knows”.  This  forces us into a very crude approximation of our knowledge.  We can either write down a meaningful universal, which is typically false.  Or a true universal, which is typically useless.  We need something that expresses the finer shades of possibility and probability.
One particularly successful choice is provided by the probabilistic framework.  Here, like in the categorical case, our semantics is based on a set of possible worlds.  For example, here our worlds correspond to the difficulty of the course, the intelligence of the student, and the grade the student received in the course.  Note that one of our worlds has a smart student getting a C in an easy class. 
But, unlike the categorical case, we can also ascribe likelihoods to each world, by associating each with a probability, allowing us to make it improbable but not impossible for a smart student to get a C.
Thus, we can see that it is unlikely for a smart student to get a C in an easy class.  And the same model can be used to tell us that, although it is possible for a C student to be smart, it is not very likely.
One of the key benefits of the propositional representation was our ability to represent our knowledge without explicit enumeration of the worlds.  In turns out that we can do the same in the probabilistic framework.  The key idea, introduced by Pearl in the Bayesian network framework, is to use locality of interaction.  This is an assumption which seems to be a fairly good approximation of the world in many cases.
This representation supports many interesting reasoning patterns.  For example, if we observe that the student gets an A, the probability that he is smart goes up.  But if we then find out that the class was easy, it reduces the impact of this evidence, and our probability goes down, although not all the way.
However, this representation suffers from the same problem as other propositional representations.  We have to create separate representational units (propositions) for the different entities in our domain.  And the problem is that these instances are not independent.  For example, the difficulty of CS101 in one network is the same difficulty as in another, so evidence in one network should influence our beliefs in another.
Several other representations now fall neatly into our axes.  But there seems to be an intriguing hole.
AI has witnessed a long standing “conflict”: logic versus probability.  I would like to claim that these are complementary approaches, not conflicting ones, and that we can gain a lot by putting them together.
Let us consider an imaginary university called St. Nordaf.  St. Nordaf has two faculty, two students, two courses, and three registrations of students in courses, each of which is associated with a registration record.  These objects are linked to each other: professors teach classes, students register in classes, etc.  Each of the objects in the domain has properties that we care about.
St. Nordaf would like to do some reasoning about its university, so it calls Acme Consulting Company.  Acme tells them that the first thing they have to do is to describe their university in some organized form, i.e., a relational database.  The schema of a database tells us what types of objects we have, what are the attributes of these objects that are of interest, and how the objects can relate to each other. 
The possible worlds for St. Nordaf consist of all possible assignments of values to all attributes of all objects in the domain.  A probabilistic model over these worlds would tell us how likely each one is, allowing us to do some inferences about how the different objects interact, just as we did in our simple Bayesian network.
Unfortunately, this is a very large number of worlds, and to specify a probabilistic model we have to specify a probability for each one. 
Furthermore, the resulting distribution would be good only for a limited time.  If St. Nordaf hired a new faculty member, or got a new student, or even if the two students registered for different classes next year,  it would no longer apply, and St. Nordaf would have to pay Acme consulting all over again.  Thus, we want a model that holds for the infinitely many potential universities that hold over this very simple schema.  Thus, we are stuck with what seems to be an impossible problem.  How do we represent an infinite set of possible distributions, each of which is by itself very complex.
The two key ideas that come to our rescue derive from the two approaches that we are trying to combine.  From relational logic, we have the notion of universal patterns, which hold for all objects in a class.  From Bayesian networks, we have the notion of locality of interaction, which in the relational case has a particular twist:  Links give us precisely a notion of “interaction”, and thereby provide a roadmap for which objects can interact with each other.
In this example, we have a template, like a universal quantifier for a probabilistic statement.  It tells us:  “For any registration record in my database, the grade of the student in the course depends on the intelligence of that student and the difficulty of that course.”  This dependency will be instantiated for every object (of the right type) in our domain.  It is also associated with a conditional probability distribution that specifies the nature of that dependence.  We can also have dependencies over several links, e.g., the satisfaction of a student on the teaching ability of the professor who teaches the course.
The semantics of this model is as something that generates an appropriate probabilistic model for any domain that we might encounter.  The instantiation is the embodiment of universals on the one hand, and locality of interaction, as specified by the links, on the other.
Note that the objects in the resulting network are not independent.  The links between objects induce direct correlations, and indirect correlations continue to propagate, leading to a complex web of influence, that ties everything in the domain together.  While this seems complicated, this is actually a good picture of what life is like.  As we live our lives, we interact with each other, and with the world around us.  These interactions often influence us, causing us to change.
This web of influence has interesting ramifications from the perspective of the types of reasoning patterns that it supports.  Consider Forrest Gump.  A priori, we believe that he is pretty likely to be smart.  Evidence about two classes that he took changes our probabilities only very slightly.  However, we see that most people who took CS101 got A’s.  In fact, even people who did fairly poorly in other classes got an A in CS101.  Therefore, we believe that CS101 is probably an easy class.  To get a C in an easy class is unlikely for a smart student, so our probability that Forrest Gump is smart goes down substantially.
This web of influence has interesting ramifications from the perspective of the types of reasoning patterns that it supports.  Consider Forrest Gump.  A priori, we believe that he is pretty likely to be smart.  Evidence about two classes that he took changes our probabilities only very slightly.  However, we see that most people who took CS101 got A’s.  In fact, even people who did fairly poorly in other classes got an A in CS101.  Therefore, we believe that CS101 is probably an easy class.  To get a C in an easy class is unlikely for a smart student, so our probability that Forrest Gump is smart goes down substantially.
This web of influence has interesting ramifications from the perspective of the types of reasoning patterns that it supports.  Consider Forrest Gump.  A priori, we believe that he is pretty likely to be smart.  Evidence about two classes that he took changes our probabilities only very slightly.  However, we see that most people who took CS101 got A’s.  In fact, even people who did fairly poorly in other classes got an A in CS101.  Therefore, we believe that CS101 is probably an easy class.  To get a C in an easy class is unlikely for a smart student, so our probability that Forrest Gump is smart goes down substantially.
But this approach is more than just a generator for infinitely many Bayesian networks.  It allows us to represent distributions that are entirely outside the scope of propositional representations: ones about the existence of objects and links in the domain. It also supports interesting reasoning patterns that are associated with these concepts.
Representation is not a goal in and of itself.  To be useful, a knowledge base must allow us to get answers to interesting questions, in an effective way.  Unfortunately, this is not so easy.  The complexity of reasoning tasks has been a long-standing problem in AI.  It was given a physical manifestation by Henry Kautz in his 1989 Computers and Thought talk, and reincarnated by Stuart Russell 6 years later.  Now, six years after that, I’d like to reintroduce the complexity monster.
Representation is not a goal in and of itself.  To be useful, a knowledge base must allow us to get answers to interesting questions, in an effective way.  Unfortunately, this is not so easy.  The complexity of reasoning tasks has been a long-standing problem in AI.  It was given a physical manifestation by Henry Kautz in his 1989 Computers and Thought talk, and reincarnated by Stuart Russell 6 years later.  Now, six years after that, I’d like to reintroduce the complexity monster.
It is well-known that the more expressive a representation language is, the harder it is to perform inference in the language.  Clearly, the corollary is that we should use less expressive representations.  So why am I trying to sell you a language that combines relational logic on the one hand, probabilities on the other?   Well, complexity can often be misleading.  Polynomial in the size of a representation is great, but not if the size is itself exponential.  Specifically, many tasks are linear time in an atomic representation --- we simply enumerate the worlds.  But the number of worlds is huge. Thus, more expressive languages are not necessarily harder.  Furthermore, expressive languages often contain structure that allows us to perform tasks much more efficiently.  So, by picking the “right” type of expressive power, we can often do inference much more efficiently. 
Divide and conquer inference allows us to perform inference separately over different parts of the knowledge base, and to combine results by sending a “message” involving only a small number of entities, e.g. (in this simple case), those in the intersection between the parts.  In general, this extends to hierarchical structures.
Divide and conquer inference allows us to perform inference separately over different parts of the knowledge base, and to combine results by sending a “message” involving only a small number of entities, e.g. (in this simple case), those in the intersection between the parts.  In general, this extends to hierarchical structures.
Divide and conquer inference allows us to perform inference separately over different parts of the knowledge base, and to combine results by sending a “message” involving only a small number of entities, e.g. (in this simple case), those in the intersection between the parts.  In general, this extends to hierarchical structures.
Why do we think that this is a useful property?  Because the world does exhibit a large amount of locality, and is often hierarchically structured.  This, in fact, was one of the key points in Herbert Simon’s seminal essay “On the Architecture of Complexity”.  In fact, Simon continued to argue in that essay that, to the extent that the world is not organized in this way, we simply cannot make sense of it, so that this assumption is the only way by which we can deal with the real world.
Our representation makes object boundaries explicit, allowing their structure to be exploited by the algorithm.  Note that it is very important
to make the structure explicit.  Although inference algorithms can try to find decompositions on their own, this is a hard problem by itself.  The
 algorithm can gain tremendously by having the structure made explicit by the representation.
This turns out to help a lot, but not quite enough.
Many algorithms treat the first-order statements as a simple macro language, and generate ground propositions from them.  In 1965, Robinson was the first to propose the idea of performing inference at the level of abstract rather than concrete entities.
We can extend these ideas to the probabilistic case, albeit in a more limited way than in theorem proving.  Rather than reasoning separately for each student in our domain, we do probabilistic inference for a “generic” student x, and then assume we have the right number of copies.
We can extend these ideas to the probabilistic case, albeit in a more limited way than in theorem proving.  Rather than reasoning separately for each student in our domain, we do probabilistic inference for a “generic” student x, and then assume we have the right number of copies.
One remaining problem is that everything is potentially correlated with everything else via the web of influence.  So, potentially our model must include every single entity in the world.  In general, our models can end up being unreasonably large.  Is that necessary?  The key idea is that many influences, although present, are fairly “small”, and we would like to ignore them.
In the deterministic framework, it is not clear what this means.  Either a fact changes the answer, in which case it is relevant, or it doesn’t, in which case it isn’t.  There is no notion of “relevant, but only a little”.  In the probabilistic framework, we have a well-defined semantics.
Furthermore, if we assume that the world is really stochastic, then “weak influence” is a very common state of affairs.  One can show formally that, in many settings, things are only weakly relevant.  This type of analysis has led to the development of a wide range of approximate inference algorithms.
To summarize, we saw that certain key aspects of our representation gave us a range of tools against the complexity monster.
Perhaps, if we attack the monster with all of these weapons, as well as some other ones still to be developed, we can finally vanquish the monster.
To have a knowledge base that we can use, we must somehow address one of the problems that has been a major obstacle in AI, that of knowledge acquisition.  One approach is to learn a knowledge base from data.  Note that this problem is somewhat different from the learning problems that we think about most of the time.  There is no clearly defined classification or performance task.  Our goal is to learn a knowledge base for general-purpose reasoning.
To check whether we can actually learn a knowledge base that is useful for general purpose reasoning from a real-world database, we looked at a database of Tuberculosis patients in San Francisco.  This is a database gathered at the TB clinic, and is used by the health-care professionals to reason about transmission of TB and to perform TB control.
Our input was the raw database, and our goal was to see whether we could automatically reconstruct the commonsense knowledge that the TB health care professional has.  We discovered many useful patterns, mostly ones that are known but also some new ones.  For example, gender is correlated with HIV status.  But more surprising (and more novel), it is also correlated with the type of tuberculosis.  And there are also interactions between objects.  Contacts of foreign-born patients are more likely to have TB, but are also more likely to obtain treatment, because of the tightness of the social structure.
Of course, data is not always so nicely arranged for us as in a relational database.  Let us consider the biggest source of data --- the world wide web.  Consider the webpages in a computer science department.  Here is one webpage, which links to another.  This second webpage links to a third, which links back to the first two.   There is also a webpage with a lot of outgoing links to webpages on this site.  This is not nice clean data.  Nobody labels these webpages for us, and tells us what they are.  We would like to learn to understand this data, and conclude from it that we have a “Professor Tom Mitchell” one of whose interests is a project called “WebKB”.  “Sean Slattery” is one of the students on the project, and Professor Mitchell is his advisor.  Finally, Tom Mitchell is a member of the CS CMU faculty, which contains many other faculty members.  How do we get from the raw data to this type of analysis?
Let us consider the more limited task of simply recognizing which webpage corresponds to which type of entity.  The most standard approach is to classify the webpages into one of several categories: faculty, student, project, etc, using the words in the webpage as features, e.g., using the naïve Bayes model.
A more interesting approach is based on the realization that this domain is relational.  It has objects that are linked to each other.  And the links are very meaningful in and of themselves.  For example, a student webpage is very likely to point to his advisor’s page.  In the Kansas slide, I showed that by making links first-class citizens in the model, we can introduce a probabilistic model over them.  Indeed, we can represent precisely this dependency, by asserting that the existence of a link depends on the category of the two pages pointing to it. This allows us to use, for example, a webpage that we are fairly sure is a student page to give evidence about the fact that a page it points to is a faculty webpage.  In fact, they can both give evidence about each other, giving rise to a form of “collective classification”. 
Yet another place where links are useful is if we explicitly model the notion of a directory page.  If a page is a faculty directory, it is highly likely to point to faculty webpages.  Thus, we can use evidence about a faculty webpage that we are fairly certain about to infer that a page pointing to it is probably a faculty directory, and based on that increase our belief that other pages that this page points to are also faculty pages.
This is just the web of influence applied to this domain!
Finally, we can also use properties of the links themselves, e.g., the anchor text around the link.  And that also gives us information which we can use for classification.
The results in the previous slides were done by training on a dataset that someone classified for us, i.e., a poor student (or two, or three) sat for weeks and labeled each webpage as a faculty, student, etc.  We then used that model, unchanged, to deal with our reasoning process.  But do we really expect that our agents will get this nicely categorized training data for every task that it would have to consider?  Do we really want our agent to stop learning once it hits the real world? 
Fortunately, our ability to perform general purpose reasoning, as in these examples, can also be used to bootstrap our learning.  How?  Let’s consider our university domain.  There, no-one tells us how bright students are, or how hard classes are.  All we have are grades.  Assume that our agent starts out with a pretty bad model for our university domain, whether elicited by hand, or randomly selected.  We can use the model in our standard reasoning algorithm, and figure out how likely each student is to be intelligent, and how likely each class to be difficult.  These are not good conclusions, but they give us one possible “filling in” for the data.  We can then us that filled in data, as if it were real, to train a better model, using the standard learning techniques that I mentioned earlier.  We can then reiterate this process. 
Somewhat surprisingly, it not only converges, but to something that’s actually pretty good.  In the end, the model is pretty close, but even more impressive, the likely values of the intelligence of the students and the difficulty of the classes are exactly right!
I could show you how EM improves accuracy of the learned models, but for reasons of time, I will focus on a more interesting use of the algorithm.  Note that our algorithm learned from data where Intelligence and Difficulty were never observed.  It was essentially inventing these concepts, i.e., defining for itself a concept of “smart” versus “weak” students, based on the data.  Of course, in this example, the concepts were there to begin with, in the data we generated.  Does this work in real data?
The learning algorithm automatically discovered coherent categories in actors, e.g., “tough guys” and serious drama actors.  We had no attributes for actors except their gender, so it only discovered these categories by seeing how the objects interact with other objects in the domain.
I’ve discussed several key tasks in building an intelligent agent, but they are far from being enough.  Let’s go back to Hal.
In this short video clip, we see Hal perform several important tasks.  It hears the speech, analyzes the language, and understands it.  It also plans a course of action in a way that is intended to achieve its goals.  Each of these is a big and important area by itself.  In both, there has been recent work using techniques that are similar to the ones I mentioned.  I will take about perception in a moment.  For planning, some key tasks include representation, reasoning, and learning.  There has been tremendous progress on all three of these tasks in the past few years.
Let us consider the perception problem.  In perception, the world generates a percept, which is the input to the agent.  The goal in the perception task is to figure out which possible world generated the percept.  There are many different instantiations of this problem. 
However, this is a very hard problem.  In general, people have a hard time understanding how hard perception is for an automated agent, because it is so easy for people.  It’s obvious, looking at a chair, that it is a chair.  Why is it so hard?  To illustrate how hard this is for an agent, here is a perception task that is also hard for people.  What is this blob?  Even if I tell you that it’s a letter, it’s still hard to tell.  But what if I tell you that this letter is part of this word?
This leads us directly into the idea of probabilistic perception, where the agent has some prior knowledge about what particular worlds are likely.  In our example above, the entire word didn’t eliminate any of the options for the letter, but it was simply much more likely to be an “a”.
This approach has had tremendous successes over the past 15 years, in a variety of very different perception tasks.
This is a great idea.  But what are the current limitations?  Current models used in probabilistic perception are at the surface level.
This idea arises in many perception tasks. In vision, we can have high-level models of activities, and of how people interact with each other, giving us much better models for visual tracking.  However, I don’t have much time left, so  I’ll spend just a couple of slides talking
about it in the context of natural language processing.
This is a great idea.  But what are the current limitations?  Current models used in probabilistic perception are at the surface level.
This idea arises in many perception tasks. In vision, we can have high-level models of activities, and of how people interact with each other, giving us much better models for visual tracking.  However, I don’t have much time left, so  I’ll spend just a couple of slides talking
about it in the context of natural language processing.
Let’s look at a simple problem in language understanding.
What kind of reasoning did we use to disambiguate this sentence?  We are just using commonsense patterns about objects, their properties, and the links between them.  But it is all probabilistic.  If we had an appropriate probabilistic relational knowledge base, perhaps we could use web of influence reasoning to reach similar conclusions.
This idea, of using a relational probabilistic language for disambiguation, was first proposed by Goldman and Charniak.  They got stuck in a place where AI often gets stuck --– knowledge acquisition. Where do these models come from?  Well, perhaps learning can come to our rescue.
Current models are at syntactic level, e.g., word-word correlations.  These contain a surprising amount of semantic information.  But that information is at the syntactic level, which substantially limits its applicability.  As a simple example, although we can learn that a “teacher” is often in the same sentence as “train”, and we can use that to disambiguate some sentences, we won’t be able to use it in contexts where
we have “Sarah” in a sentence, who we happen to know is a teacher from previous sentences.
In AI, as in many communities, we have the tendency to divide a problem into well-defined pieces, and make progress on each one.  But as we make progress, the problems tend to move away from each other. 
Part of our solution to the AI problem must involve building bridges between the pieces, to create a unified whole.  In this talk I’ve described three bridges, which build on each other. 
Perhaps, if we continue making progress and building bridges, we will eventually be able to build a Hal.