To appear, Fifteenth National Conf. on Artificial Intelligence (AAAI), 1998.
A central problem in learning in complex environments is balancing exploration
of untested actions against exploitation of actions that are known to be good.
The benefit of exploration can be estimated using the classical notion of Value of
Information---the expected improvement in future decision quality that might arise
from the information acquired by exploration. Estimating this quantity requires an
assessment of the agent's uncertainty about its current value estimates for states. In
this paper, we adopt a Bayesian approach to maintaining this uncertain information. We
extend Watkins' Q-learning by maintaining and propagating probability distributions over
the Q-values. These distributions are used to compute a myopic approximation to the value
of information for each action and hence to select the action that best balances
exploration and exploitation. We establish the convergence properties of our algorithm and
show experimentally that it can exhibit substantial improvements over other well-known
model-free exploration strategies.
Back to Nir's publications page