When the Best Move Isn't Optimal: Q-learning with Exploration

			     George John
			Computer Science Dept.
			 Stanford University
			  Stanford, CA 94305
			gjohn@cs.stanford.edu


The most popular delayed reinforcement learning technique, Q-learning
(Watkins 1989), estimates the future reward expected from executing
each action in every state.  If these estimates are correct, then an
agent can use them to select the action with maximal expected future
reward in each state, and thus perform optimally.  Watkins has proved
that Q-learning produces an optimal policy (the function mapping
states to actions) and that these estimates converge to the correct
values given the optimal policy.  However, often the agent does not
follow the optimal policy faithfully -- the agent must also explore
the world, taking suboptimal actions in order to learn more about its
environment.  The ``optimal'' policy produced by Q-learning is no
longer optimal if its prescriptions are only followed occasionally.
In many situations (e.g., dynamic environments), the agent never stops
exploring.  In such domains Q-learning converges to policies which are
suboptimal in the sense that there exists a different policy which
would achieve higher reward when combined with exploration.  We
present a new algorithm, a very small modification of Q-learning,
which finds policies that achieve greater reward given a particluar
exploration startegy.


Citation:
John, George (1994) When the Best Move Isn't Optimal: Q-learning with
Exploration.  Proceedings of the National Conference on Artificial
Intelligence (AAAI-94), p. 1464.  AAAI/MIT Press.

Note that the paper is a "Student Abstract," and is only one page
long.