In a previous post, we have learnt to use the OpenAI Gym library to implement a very simple decision policy (either random or hard-coded). I recommend that you first go through the first post before reading this one. In this post, we will learn how to use a more sophisticated reinforcement learning algorithm : Q-Learning.

This post is inspired from Thomas Simonini’s page.

Q-Learning consists in learning what are the series of actions to take that maximise the expected cumulative reward. We will illustrate the concept of Q-Learning using the Taxi problem from the OpenAI Gym library. Q-learning works better when we have a discrete number of states (such as the Taxi problem), we will explain later how to apply it to problems with continuous states such as the Cart Pole problem.

The Taxi problem

The taxi problem consists of a 5-by-5 grid where a taxi can move. The goal is to pick up a passenger at one of the 4 possible locations and to drop him off in another.

**The rules**

*There are 4 designated locations on the grid that are indicated by a letter: R(ed), B(lue), G(reen), and Y(ellow). When the episode starts, the taxi starts off at a random square and the passenger is at a random location. The taxi drive to the passenger’s location, pick up the passenger, drive to the passenger’s destination (another one of the four specified locations), and then drop off the passenger. Once the passenger is dropped off, the episode ends. You receive +20 points for a successful dropoff, and lose 1 point for every timestep it takes. There is also a 10 point penalty for illegal pick-up and drop-off actions.*

There are 500 discrete states since there are 25 taxi positions, 5 possible locations of the passenger (including the case when the passenger is the taxi), and 4 destination locations. (500 = 25*5*4).

There are 6 discrete deterministic actions:

– 0: move down

– 1: move up

– 2: move to the right

– 3: move to the left

– 4: pick up a passenger

– 5: dropoff a passenger

The color coding is as follows:

– blue: passenger

– magenta: destination

– yellow: empty taxi

– green: full taxi

– other letters: locations

Q-Learning

How to choose the appropriate actions to solve this problem?

Q-Learning is a reinforcement learning algorithm aiming to maximise the expected cumulative reward over an episode.

To be continued…

The code can be found here. The cumulative reward vs the number of episode is shown below.

Next steps

Q-learning is one of the simplest Reinforcement Learning algorithms. The problem with Q-learning however is, once the number of states in the environment are very high, it becomes difficult to implement them with Q table as the size would become very large. State of the art techniques uses Deep neural networks instead of the Q-table (Deep Q Learning DQN).