When the Best Move Isn't Optimal: Q-learning with Exploration George John Computer Science Dept. Stanford University Stanford, CA 94305 gjohn@cs.stanford.edu The most popular delayed reinforcement learning technique, Q-learning (Watkins 1989), estimates the future reward expected from executing each action in every state. If these estimates are correct, then an agent can use them to select the action with maximal expected future reward in each state, and thus perform optimally. Watkins has proved that Q-learning produces an optimal policy (the function mapping states to actions) and that these estimates converge to the correct values given the optimal policy. However, often the agent does not follow the optimal policy faithfully -- the agent must also explore the world, taking suboptimal actions in order to learn more about its environment. The ``optimal'' policy produced by Q-learning is no longer optimal if its prescriptions are only followed occasionally. In many situations (e.g., dynamic environments), the agent never stops exploring. In such domains Q-learning converges to policies which are suboptimal in the sense that there exists a different policy which would achieve higher reward when combined with exploration. We present a new algorithm, a very small modification of Q-learning, which finds policies that achieve greater reward given a particluar exploration startegy. Citation: John, George (1994) When the Best Move Isn't Optimal: Q-learning with Exploration. Proceedings of the National Conference on Artificial Intelligence (AAAI-94), p. 1464. AAAI/MIT Press. Note that the paper is a "Student Abstract," and is only one page long.