When the Best Move Isn't Optimal:  Q-learning with Exploration

			    George H. John
		     Computer Science Department
			 Stanford University
		       Stanford, CA 94305-4110
			gjohn@cs.stanford.edu
		 http://robotics.stanford.edu/~gjohn/

  In delayed reinforcement learning, an agent is concerned with the
  problem of discovering an optimal policy, a function mapping states to
  actions.  The most popular delayed reinforcement learning technique,
  Q-learning, has been proven to produce an optimal policy under certain
  conditions.  However, often the agent does not follow the optimal
  policy faithfully -- the agent must also explore the world.  The
  optimal policy produced by Q-learning is no longer optimal if its
  prescriptions are only followed occasionally.  In many situations
  (e.g., dynamic environments), the agent never stops exploring.  In
  such domains Q-learning converges to policies that are suboptimal when
  combined with the exploration policy, and we give several examples of
                            _
  this problem.  We present Q-learning, a slight modification of
  Q-learning that provides a policy resulting in higher reward when
  combined with a particular exploration strategy.

Citation:

NOT FOR CITATION.  Submitted to NIPS*95.  The author is preparing a 
technical note on the work.