‹header›
‹date/time›
Click to edit Master text styles
Second level
Third level
Fourth level
Fifth level
‹footer›
‹#›
A somewhat more transparent representation is based on attributes or
propositions. Here, the representation
makes an explicit commitment
that the world is described by a set of attributes or (in the binary case)
propositional symbols.
The knowledge
base tells us how these different symbols interact with each other.
Semantically, each possible world that the
agent considers is an assignment of values to these attributes.
This more transparent representation gives
us a lot more power.
For example, using
logical inference techniques we can often conclude that certain propositions
must take a specific truth value, without enumerating the entire space of
truth assignments.
However, propositional reasoning has its own problems.
In many cases, it requires that we repeat
what is basically the same piece of information many many times.
For example, if we wanted to write down the
rules of chess (just those governing the single-move successor function), we
would have to restate the rules for pawn movement separately for every board
position.
In fact, this process would
require about 10^7 axioms.
I can’t write
even one of these axioms down for you, because it would be too long, so let me
use a simpler example, one which will be my running example for most of this
talk.
A somewhat more transparent representation is based on attributes or
propositions. Here, the representation
makes an explicit commitment
that the world is described by a set of attributes or (in the binary case)
propositional symbols.
The knowledge
base tells us how these different symbols interact with each other.
Semantically, each possible world that the
agent considers is an assignment of values to these attributes.
This more transparent representation gives
us a lot more power.
For example, using
logical inference techniques we can often conclude that certain propositions
must take a specific truth value, without enumerating the entire space of
truth assignments.
However, propositional reasoning has its own problems.
In many cases, it requires that we repeat
what is basically the same piece of information many many times.
For example, if we wanted to write down the
rules of chess (just those governing the single-move successor function), we
would have to restate the rules for pawn movement separately for every board
position.
In fact, this process would
require about 10^7 axioms.
I can’t write
even one of these axioms down for you, because it would be too long, so let me
use a simpler example, one which will be my running example for most of this
talk.
The reason for the tremendous
representation inefficiency of propositional logic is that it does not contain
the concept of an object. When we think
about the world, we always think about objects --- people, cars, cities, even
talks --- are viewed as separate entities.
Each of these has its own properties, or attributes. And, they are all linked to each other via a
complex network of interactions. A yet
more transparent representation is relational logic. In this case, the possible worlds specify a
set of objects, their properties, and the relations between them. By making objects explicit, we can now
generalize over them, specifying universal patterns that apply to all
objects. This allows us to state many
things at once, and provide a much more compact representation. In particular, the rules of chess now
require only about 100 axioms.
In a categorical representation, each
assertion must fall into one of three categories: necessarily true,
impossible, or “who knows”. This forces us into a very crude approximation of
our knowledge. We can either write down
a meaningful universal, which is typically false. Or a true universal, which is typically
useless. We need something that
expresses the finer shades of possibility and probability.
One particularly successful choice is
provided by the probabilistic framework.
Here, like in the categorical case, our semantics is based on a set of
possible worlds. For example, here our
worlds correspond to the difficulty of the course, the intelligence of the
student, and the grade the student received in the course. Note that one of our worlds has a smart
student getting a C in an easy class.
But, unlike the categorical case, we can
also ascribe likelihoods to each world, by associating each with a
probability, allowing us to make it improbable but not impossible for a smart
student to get a C.
Many of the standard methods can be categorized along this axis. But, there is also a second important design
decision --- how do we
represent the epistemic state of the agent, its beliefs about the
world. Two common choices are
categorical, or deterministic, beliefs, and
probabilistic beliefs.
We will look at 2 classes of prob
graphical models that represent complex distributions compactly and allow us
to reason
efficiently.
First, we
will look at directed models, called Bayesian networks and their extension to
relational data, PRMs.
We’ll then consider
undirected models, such as Markov nets and Relational Markov nets.
One of the key benefits of the
propositional representation is our ability to represent our knowledge without
explicit enumeration of the worlds.
In
turns out that we can do the same in the probabilistic framework.
The key idea, introduced by Pearl in the
Bayesian network framework, is to use locality of interaction.
We represent the interactions of the variables
using a graph structure.
For each
variable, we select as parents the variables that
directly influence it.
This is an assumption which seems to be a fairly good approximation of the
world in many cases.
Each cpt is independent of others – changing values locally does not affect
coherence of the model
The product of the CPTs represents the
full joint distribution over the variables.
This distrib specifies answers to
Queries of the form P(X|E). The encoded distribution supports very subtle
reasoning patterns:
Suppose we find out that the letter was positive – the
probability of high grade will go up, increasing
The prob of high intell
and decreasing the prob of difficulty of a class. However if we later find out that the
student has high SAT score,
The prob of intell goes up higher, and now the
prob of the class difficulty increases.
The product of the CPTs represents the
full joint distribution over the variables.
This distrib specifies answers to
Queries of the form P(X|E). The encoded distribution supports very subtle
reasoning patterns:
Suppose we find out that the letter was positive – the
probability of high grade will go up, increasing
The prob of high intell
and decreasing the prob of difficulty of a class. However if we later find out that the
student has high SAT score,
The prob of intell goes up higher, and now the
prob of the class difficulty increases.
We will look at 2 classes of prob
graphical models that represent complex distributions compactly and allow us
to reason
efficiently.
First, we
will look at directed models, called Bayesian networks and their extension to
relational data, PRMs.
We’ll then consider
undirected models, such as Markov nets and Relational Markov nets.
One of the key benefits of the
propositional representation was our ability to represent our knowledge
without explicit enumeration of the worlds.
In turns out that we can do the same in the probabilistic framework.
The key idea, introduced by Pearl in the
Bayesian network framework, is to use locality of interaction.
This is an assumption which seems to be a
fairly good approximation of the world in many cases.
However, this representation suffers from the same problem as other
propositional representations.
We have
to create separate representational units (propositions) for the different
entities in our domain.
And the problem
is that these instances are not independent.
For example, the difficulty of CS101 in one network is the same
difficulty as in another, so evidence in one network should influence our
beliefs in another.
Let us consider an imaginary university
called St. Nordaf. St. Nordaf has two
faculty, two students, two courses, and three registrations of students in
courses, each of which is associated with a registration record. These objects are linked to each other:
professors teach classes, students register in classes, etc. Each of the objects in the domain has
properties that we care about.
The university can be described in an
organized form using a relational database.
The schema of a database tells us what types of objects we have, what
are the attributes of these objects that are of interest, and how the objects
can relate to each other.
The two key ideas that come to our
rescue derive from the two approaches that we are trying to combine.
From relational logic, we have the notion of
universal patterns, which hold for all objects in a class.
From Bayesian networks, we have the notion
of locality of interaction, which in the relational case has a particular
twist:
Links give us precisely a notion
of “interaction”, and thereby provide a roadmap for which objects can interact
with each other.
In this example, we have a template, like a universal quantifier for a
probabilistic statement.
It tells
us:
“For any registration record in my
database, the grade of the student in the course depends on the intelligence
of that student and the difficulty of that course.”
This dependency will be instantiated for
every object (of the right type) in our domain.
It is also associated with a conditional
probability distribution that specifies the nature of that dependence.
We can also have dependencies over several
links, e.g., the satisfaction of a student on the teaching ability of the
professor who teaches the course.
The semantics of this model is as
something that generates an appropriate probabilistic model for any domain
that we might encounter. The
instantiation is the embodiment of universals on the one hand, and locality of
interaction, as specified by the links, on the other.
This web of influence has interesting
ramifications from the perspective of the types of reasoning patterns that it
supports. Consider George. A priori, we believe that he is pretty
likely to be smart. Evidence about two
classes that he took changes our probabilities only very slightly. However, we see that most people who took
CS101 got A’s. In fact, even people who
did fairly poorly in other classes got an A in CS101. Therefore, we believe that CS101 is probably
an easy class. To get a C in an easy
class is unlikely for a smart student, so our probability that Forrest Gump is
smart goes down substantially.
This web of influence has interesting
ramifications from the perspective of the types of reasoning patterns that it
supports. Consider George. A priori, we believe that he is pretty
likely to be smart. Evidence about two
classes that he took changes our probabilities only very slightly. However, we see that most people who took
CS101 got A’s. In fact, even people who
did fairly poorly in other classes got an A in CS101. Therefore, we believe that CS101 is probably
an easy class. To get a C in an easy
class is unlikely for a smart student, so our probability that Forrest Gump is
smart goes down substantially.
We will look at 2 classes of prob
graphical models that represent complex distributions compactly and allow us
to reason
efficiently.
First, we
will look at directed models, called Bayesian networks and their extension to
relational data, PRMs.
We’ll then consider
undirected models, such as Markov nets and Relational Markov nets.
How do we learn such models from a
database.
If an expert specifies the
dependency graph of the PRM, the task of estimating
the parameters is fairly easy.
E.g. to estimate P(A|high,easy) we simply compute the ratio of A-grades
received by highly intelligent students in easy classes
To all grades received by highly intelligent students in easy
classes.
Note that we can estimate each cond prob table independently of others.
However, often variables of interest are not all observed in the database.
How can we estimate parameters in such
cases?
Of course, data is not always so nicely
arranged for us as in a relational database.
Let us consider the biggest source of data --- the world wide web. Consider the webpages in a computer science
department. Here is one webpage, which
links to another. This second webpage
links to a third, which links back to the first two. There is also a webpage with a lot of
outgoing links to webpages on this site.
This is not nice clean data.
Nobody labels these webpages for us, and tells us what they are. We would like to learn to understand this data,
and conclude from it that we have a “Professor Tom Mitchell” one of whose
interests is a project called “WebKB”.
“Sean Slattery” is one of the students on the project, and Professor
Mitchell is his advisor. Finally, Tom
Mitchell is a member of the CS CMU faculty, which contains many other faculty
members. How do we get from the raw
data to this type of analysis?
Let us consider the more limited task of
simply recognizing which webpage corresponds to which type of entity.
The most standard approach is to classify
the webpages into one of several categories: faculty, student, project, etc,
using the words in the webpage as features, e.g., using the naïve Bayes model.
Look at pages that point to Tom Mithchell
… add meta words – anchor text, titles, headings, etc
A more interesting approach is based on
the realization that this domain is
relational.
It has objects that are linked to each
other.
And the links are very
meaningful in and of themselves.
For
example, a student webpage is very likely to point to his advisor’s page.
But faculty rarely link to each other,
neither do courses and researh projects.
Note that we can’t have an edge directly between categories of the pages
– this will cause cycles in the bayes net.
However we can represent these trends by asserting that the existence of a
link depends on the category of the two pages pointing to it. This allows us
to use, for example, a webpage that we are fairly sure is a student page to
give evidence about the fact that a page it points to is a faculty
webpage.
In fact, they can both give
evidence about each other, giving rise to a form of “collective
classification”.
Fortunately, our ability to perform
general purpose reasoning, as in these examples, can also be used to bootstrap
our learning.
How?
Let’s consider our university domain.
There, no-one tells us how bright students
are, or how hard classes are.
All we
have are grades.
Assume that our agent
starts out with a pretty bad model for our university domain, whether elicited
by hand, or randomly selected.
We can
use the model in our standard reasoning algorithm, and figure out how likely
each student is to be intelligent, and how likely each class to be
difficult.
These are not good
conclusions, but they give us one possible “filling in” for the data.
We can then us that filled in data, as if it
were real, to train a better model, using the standard learning techniques
that I mentioned earlier.
We can then
reiterate this process.
Somewhat surprisingly, it not only converges, but to something that’s
actually pretty good. In the end, the
model is pretty close, but even more impressive, the likely values of the
intelligence of the students and the difficulty of the classes are exactly
right!
We first introduce a type variable for
movies that will influence all its attiributes.
For actors and directors,
IMDB does not provide attributes.
However, we would like to cluster actors
and directors
jointly.
So we introduce a hidden type
variable for actors and directors and express
The intuition that that they directly influence the type of movie.
The learning algorithm automatically
discovered coherent categories in actors, e.g., “tough guys” and serious drama
actors. We had no attributes for
directors and actors, so it only discovered these categories by seeing how the
objects interact with other objects in the domain.
We will look at 2 classes of prob
graphical models that represent complex distributions compactly and allow us
to reason
efficiently.
First, we
will look at directed models, called Bayesian networks and their extension to
relational data, PRMs.
We’ll then consider
undirected models, such as Markov nets and Relational Markov nets.
One key difference between bns and mns
is that in bns we model dependence of each variable on others, while
in mns, we model interactions of sets of variables.
This buys us the freedom from acyclicity constraint at the
expense of difficulty of interpreting the parameters.
It handles symmetric interactions, where cause and effect cannot be identified.
The two key ideas that come to our
rescue derive from the two approaches that we are trying to combine.
From relational logic, we have the notion of
universal patterns, which hold for all objects in a class.
From Bayesian networks, we have the notion
of locality of interaction, which in the relational case has a particular
twist:
Links give us precisely a notion
of “interaction”, and thereby provide a roadmap for which objects can interact
with each other.
In this example, we have a template, like a universal quantifier for a
probabilistic statement.
It tells
us:
“For any registration record in my
database, the grade of the student in the course depends on the intelligence
of that student and the difficulty of that course.”
This dependency will be instantiated for
every object (of the right type) in our domain.
It is also associated with a conditional
probability distribution that specifies the nature of that dependence.
We can also have dependencies over several
links, e.g., the satisfaction of a student on the teaching ability of the
professor who teaches the course.
The two key ideas that come to our
rescue derive from the two approaches that we are trying to combine.
From relational logic, we have the notion of
universal patterns, which hold for all objects in a class.
From Bayesian networks, we have the notion
of locality of interaction, which in the relational case has a particular
twist:
Links give us precisely a notion
of “interaction”, and thereby provide a roadmap for which objects can interact
with each other.
In this example, we have a template, like a universal quantifier for a probabilistic
statement.
It tells us:
“For any registration record in my database,
the grade of the student in the course depends on the intelligence of that
student and the difficulty of that course.”
This dependency will be instantiated for every object (of the right
type) in our domain.
It is also
associated with a conditional probability distribution that specifies the
nature of that dependence.
We can also
have dependencies over several links, e.g., the satisfaction of a student on the
teaching ability of the professor who teaches the course.
We will look at 2 classes of prob
graphical models that represent complex distributions compactly and allow us
to reason
efficiently.
First, we
will look at directed models, called Bayesian networks and their extension to
relational data, PRMs.
We’ll then consider
undirected models, such as Markov nets and Relational Markov nets.
Parameters are not independent
Let’s get back to the task of classifying web pages.
Now, what we really need is a model that predicts categories from words.
We can learn such a model – it
Can be
viewed as a conditional markov net or more commonly known as generalized
logistic regression.
We would also like to consider patterns
consisting of more than just a single link.
For example, page structure has very strong organizational patterns. Consider a web page of
We would also like to consider patterns
consisting of more than just a single link.
For example, page structure has very strong organizational patterns. Consider a web page of
We will look at 2 classes of prob
graphical models that represent complex distributions compactly and allow us
to reason
efficiently.
First, we
will look at directed models, called Bayesian networks and their extension to
relational data, PRMs.
We’ll then consider
undirected models, such as Markov nets and Relational Markov nets.
Aggregate information from several sources
Accumulate (probablistic) profile of entities, relations
Build a global model of a corpus to help disambiguate passages
Let’s look at a simple problem in
language understanding.
What kind of reasoning did we use to
disambiguate this sentence? We are just
using commonsense patterns about objects, their properties, and the links
between them. But it is all
probabilistic. If we had an appropriate
probabilistic relational knowledge base, perhaps we could use web of influence
reasoning to reach similar conclusions.
This idea, of using a relational
probabilistic language for disambiguation, was first proposed by Goldman and
Charniak.
They got stuck in a place
where AI often gets stuck --– knowledge acquisition. Where do these models
come from?
Well, perhaps learning can
come to our rescue.
Current models are at syntactic level, e.g., word-word correlations.
These contain a surprising amount of
semantic information.
But that
information is at the syntactic level, which substantially limits its
applicability.
As a simple example,
although we can learn that a “teacher” is often in the same sentence as
“train”, and we can use that to disambiguate some sentences, we won’t be able
to use it in contexts where
we have “Sarah” in a sentence, who we happen to know is a teacher from
previous sentences.