‹header›

‹date/time›

Click to edit Master text styles

Second level

Third level

Fourth level

Fifth level

‹footer›

‹#›

A somewhat more transparent representation is based on attributes or propositions. Here, the representation makes an explicit commitment

that the world is described by a set of attributes or (in the binary case) propositional symbols. The knowledge base tells us how these different symbols interact with each other. Semantically, each possible world that the agent considers is an assignment of values to these attributes. This more transparent representation gives us a lot more power. For example, using logical inference techniques we can often conclude that certain propositions must take a specific truth value, without enumerating the entire space of truth assignments.

However, propositional reasoning has its own problems. In many cases, it requires that we repeat what is basically the same piece of information many many times. For example, if we wanted to write down the rules of chess (just those governing the single-move successor function), we would have to restate the rules for pawn movement separately for every board position. In fact, this process would require about 10^7 axioms. I can’t write even one of these axioms down for you, because it would be too long, so let me use a simpler example, one which will be my running example for most of this talk.

A somewhat more transparent representation is based on attributes or propositions. Here, the representation makes an explicit commitment

The reason for the tremendous representation inefficiency of propositional logic is that it does not contain the concept of an object. When we think about the world, we always think about objects --- people, cars, cities, even talks --- are viewed as separate entities. Each of these has its own properties, or attributes. And, they are all linked to each other via a complex network of interactions. A yet more transparent representation is relational logic. In this case, the possible worlds specify a set of objects, their properties, and the relations between them. By making objects explicit, we can now generalize over them, specifying universal patterns that apply to all objects. This allows us to state many things at once, and provide a much more compact representation. In particular, the rules of chess now require only about 100 axioms.

In a categorical representation, each assertion must fall into one of three categories: necessarily true, impossible, or “who knows”. This forces us into a very crude approximation of our knowledge. We can either write down a meaningful universal, which is typically false. Or a true universal, which is typically useless. We need something that expresses the finer shades of possibility and probability.

One particularly successful choice is provided by the probabilistic framework. Here, like in the categorical case, our semantics is based on a set of possible worlds. For example, here our worlds correspond to the difficulty of the course, the intelligence of the student, and the grade the student received in the course. Note that one of our worlds has a smart student getting a C in an easy class.

But, unlike the categorical case, we can also ascribe likelihoods to each world, by associating each with a probability, allowing us to make it improbable but not impossible for a smart student to get a C.

Many of the standard methods can be categorized along this axis. But, there is also a second important design decision --- how do we

represent the epistemic state of the agent, its beliefs about the world. Two common choices are categorical, or deterministic, beliefs, and

probabilistic beliefs.

We will look at 2 classes of prob graphical models that represent complex distributions compactly and allow us to reason efficiently. First, we will look at directed models, called Bayesian networks and their extension to relational data, PRMs. We’ll then consider

undirected models, such as Markov nets and Relational Markov nets.

One of the key benefits of the propositional representation is our ability to represent our knowledge without explicit enumeration of the worlds. In turns out that we can do the same in the probabilistic framework. The key idea, introduced by Pearl in the Bayesian network framework, is to use locality of interaction. We represent the interactions of the variables using a graph structure. For each variable, we select as parents the variables that

directly influence it.

This is an assumption which seems to be a fairly good approximation of the world in many cases.

Each cpt is independent of others – changing values locally does not affect coherence of the model

The product of the CPTs represents the full joint distribution over the variables. This distrib specifies answers to Queries of the form P(X|E). The encoded distribution supports very subtle reasoning patterns: Suppose we find out that the letter was positive – the probability of high grade will go up, increasing The prob of high intell and decreasing the prob of difficulty of a class. However if we later find out that the student has high SAT score, The prob of intell goes up higher, and now the prob of the class difficulty increases.

undirected models, such as Markov nets and Relational Markov nets.

One of the key benefits of the propositional representation was our ability to represent our knowledge without explicit enumeration of the worlds. In turns out that we can do the same in the probabilistic framework. The key idea, introduced by Pearl in the Bayesian network framework, is to use locality of interaction. This is an assumption which seems to be a fairly good approximation of the world in many cases.

However, this representation suffers from the same problem as other propositional representations. We have to create separate representational units (propositions) for the different entities in our domain. And the problem is that these instances are not independent. For example, the difficulty of CS101 in one network is the same difficulty as in another, so evidence in one network should influence our beliefs in another.

Let us consider an imaginary university called St. Nordaf. St. Nordaf has two faculty, two students, two courses, and three registrations of students in courses, each of which is associated with a registration record. These objects are linked to each other: professors teach classes, students register in classes, etc. Each of the objects in the domain has properties that we care about.

The university can be described in an organized form using a relational database. The schema of a database tells us what types of objects we have, what are the attributes of these objects that are of interest, and how the objects can relate to each other.

The two key ideas that come to our rescue derive from the two approaches that we are trying to combine. From relational logic, we have the notion of universal patterns, which hold for all objects in a class. From Bayesian networks, we have the notion of locality of interaction, which in the relational case has a particular twist: Links give us precisely a notion of “interaction”, and thereby provide a roadmap for which objects can interact with each other.

In this example, we have a template, like a universal quantifier for a probabilistic statement. It tells us: “For any registration record in my database, the grade of the student in the course depends on the intelligence of that student and the difficulty of that course.” This dependency will be instantiated for every object (of the right type) in our domain. It is also associated with a conditional probability distribution that specifies the nature of that dependence. We can also have dependencies over several links, e.g., the satisfaction of a student on the teaching ability of the professor who teaches the course.

The semantics of this model is as something that generates an appropriate probabilistic model for any domain that we might encounter. The instantiation is the embodiment of universals on the one hand, and locality of interaction, as specified by the links, on the other.

This web of influence has interesting ramifications from the perspective of the types of reasoning patterns that it supports. Consider George. A priori, we believe that he is pretty likely to be smart. Evidence about two classes that he took changes our probabilities only very slightly. However, we see that most people who took CS101 got A’s. In fact, even people who did fairly poorly in other classes got an A in CS101. Therefore, we believe that CS101 is probably an easy class. To get a C in an easy class is unlikely for a smart student, so our probability that Forrest Gump is smart goes down substantially.

undirected models, such as Markov nets and Relational Markov nets.

How do we learn such models from a database. If an expert specifies the dependency graph of the PRM, the task of estimating

the parameters is fairly easy.

E.g. to estimate P(A|high,easy) we simply compute the ratio of A-grades received by highly intelligent students in easy classes

To all grades received by highly intelligent students in easy classes.

Note that we can estimate each cond prob table independently of others.

However, often variables of interest are not all observed in the database. How can we estimate parameters in such cases?

Of course, data is not always so nicely arranged for us as in a relational database. Let us consider the biggest source of data --- the world wide web. Consider the webpages in a computer science department. Here is one webpage, which links to another. This second webpage links to a third, which links back to the first two. There is also a webpage with a lot of outgoing links to webpages on this site. This is not nice clean data. Nobody labels these webpages for us, and tells us what they are. We would like to learn to understand this data, and conclude from it that we have a “Professor Tom Mitchell” one of whose interests is a project called “WebKB”. “Sean Slattery” is one of the students on the project, and Professor Mitchell is his advisor. Finally, Tom Mitchell is a member of the CS CMU faculty, which contains many other faculty members. How do we get from the raw data to this type of analysis?

Let us consider the more limited task of simply recognizing which webpage corresponds to which type of entity. The most standard approach is to classify the webpages into one of several categories: faculty, student, project, etc, using the words in the webpage as features, e.g., using the naïve Bayes model.

Look at pages that point to Tom Mithchell

… add meta words – anchor text, titles, headings, etc

A more interesting approach is based on the realization that this domain is relational. It has objects that are linked to each other. And the links are very meaningful in and of themselves. For example, a student webpage is very likely to point to his advisor’s page. But faculty rarely link to each other, neither do courses and researh projects.

Note that we can’t have an edge directly between categories of the pages – this will cause cycles in the bayes net.

However we can represent these trends by asserting that the existence of a link depends on the category of the two pages pointing to it. This allows us to use, for example, a webpage that we are fairly sure is a student page to give evidence about the fact that a page it points to is a faculty webpage. In fact, they can both give evidence about each other, giving rise to a form of “collective classification”.

Fortunately, our ability to perform general purpose reasoning, as in these examples, can also be used to bootstrap our learning. How? Let’s consider our university domain. There, no-one tells us how bright students are, or how hard classes are. All we have are grades. Assume that our agent starts out with a pretty bad model for our university domain, whether elicited by hand, or randomly selected. We can use the model in our standard reasoning algorithm, and figure out how likely each student is to be intelligent, and how likely each class to be difficult. These are not good conclusions, but they give us one possible “filling in” for the data. We can then us that filled in data, as if it were real, to train a better model, using the standard learning techniques that I mentioned earlier. We can then reiterate this process.

Somewhat surprisingly, it not only converges, but to something that’s actually pretty good. In the end, the model is pretty close, but even more impressive, the likely values of the intelligence of the students and the difficulty of the classes are exactly right!

We first introduce a type variable for movies that will influence all its attiributes. For actors and directors, IMDB does not provide attributes. However, we would like to cluster actors and directors jointly. So we introduce a hidden type variable for actors and directors and express

The intuition that that they directly influence the type of movie.

The learning algorithm automatically discovered coherent categories in actors, e.g., “tough guys” and serious drama actors. We had no attributes for directors and actors, so it only discovered these categories by seeing how the objects interact with other objects in the domain.

undirected models, such as Markov nets and Relational Markov nets.

One key difference between bns and mns is that in bns we model dependence of each variable on others, while

in mns, we model interactions of sets of variables.

This buys us the freedom from acyclicity constraint at the

expense of difficulty of interpreting the parameters.

It handles symmetric interactions, where cause and effect cannot be identified.

undirected models, such as Markov nets and Relational Markov nets.

Parameters are not independent

Let’s get back to the task of classifying web pages.

Now, what we really need is a model that predicts categories from words. We can learn such a model – it Can be viewed as a conditional markov net or more commonly known as generalized logistic regression.

We would also like to consider patterns consisting of more than just a single link. For example, page structure has very strong organizational patterns. Consider a web page of

undirected models, such as Markov nets and Relational Markov nets.

Aggregate information from several sources

Accumulate (probablistic) profile of entities, relations

Build a global model of a corpus to help disambiguate passages

Let’s look at a simple problem in language understanding.

What kind of reasoning did we use to disambiguate this sentence? We are just using commonsense patterns about objects, their properties, and the links between them. But it is all probabilistic. If we had an appropriate probabilistic relational knowledge base, perhaps we could use web of influence reasoning to reach similar conclusions.

This idea, of using a relational probabilistic language for disambiguation, was first proposed by Goldman and Charniak. They got stuck in a place where AI often gets stuck --– knowledge acquisition. Where do these models come from? Well, perhaps learning can come to our rescue.

Current models are at syntactic level, e.g., word-word correlations. These contain a surprising amount of semantic information. But that information is at the syntactic level, which substantially limits its applicability. As a simple example, although we can learn that a “teacher” is often in the same sentence as “train”, and we can use that to disambiguate some sentences, we won’t be able to use it in contexts where

we have “Sarah” in a sentence, who we happen to know is a teacher from previous sentences.