When the Best Move Isn't Optimal: Q-learning with Exploration George H. John Computer Science Department Stanford University Stanford, CA 94305-4110 gjohn@cs.stanford.edu http://robotics.stanford.edu/~gjohn/ In delayed reinforcement learning, an agent is concerned with the problem of discovering an optimal policy, a function mapping states to actions. The most popular delayed reinforcement learning technique, Q-learning, has been proven to produce an optimal policy under certain conditions. However, often the agent does not follow the optimal policy faithfully -- the agent must also explore the world. The optimal policy produced by Q-learning is no longer optimal if its prescriptions are only followed occasionally. In many situations (e.g., dynamic environments), the agent never stops exploring. In such domains Q-learning converges to policies that are suboptimal when combined with the exploration policy, and we give several examples of _ this problem. We present Q-learning, a slight modification of Q-learning that provides a policy resulting in higher reward when combined with a particular exploration strategy. Citation: NOT FOR CITATION. Submitted to NIPS*95. The author is preparing a technical note on the work.