Part 6 – Q learning for continuous state problems

In a previous post, we used Q learning to solve simple grid-world problems such as a maze or Taxi-v2. For these type of problems, it was possible to use a Q-table composed of a finite number of rows corresponding to each possible state. However in most real-life problems, the number of possible states is infinite so it is virtually impossible to define a Q-table as in the previous post. A workaround consists in discretising the state space into buckets and use these buckets as an entry in the Q-table. I will illustrate this concept using the Cart-Pole environment from OpenAI Gym.

Understanding the environment

The Cart-Pole environment was described in a previous post. Basically, the agent must learn to balance an inverted pendulum by pushing a cart sliding on a rail towards the right or towards the left.

Let’s have a look at the problem’s characteristics.

The action space is discrete and can only take 2 values: push left (0) or push right (1).

The state space is continuous and is defined by 4 continuous values whose boundaries are defined below.

0Cart Position-4.84.8
1Cart Velocity-InfInf
2Pole Angle-41.8°41.8°
3Pole Velocity At Tip-InfInf

Discretising the state space

In order to define the Q matrix, it is necessary to discretise each features in the state space into buckets (or bins).

The optimal number of bucket for each state value (x, x’, theta, theta’) can be found by trial and error by attempting to maximise the sum of all the cumulative reward of all the episodes divided by the number of training episodes. Since we assign only one bucket to the cart position and velocity, it means that we ignore these 2 values.

Note that it is necessary to give finite bounds and reduce the range of the cart velocity and pole velocity at tip – i.e. [-0.5, +0.5] and [-50, +50], respectively – in order to be able to discretise.

Let’s now define the hyperparameters and some necessary functions.

The Q-matrix has now a dimension of (1 x 1 x 6 x 12 x 2). We can now train the agent as we have been doing it previously. The only difference is that we need to discretise the states received from the environment before passing them to the Q matrix.

The full code can be found here.

Discretising the state and action space

It is possible to discretise both the state and action space. For example, the Pendulum-v0 environment is characterised by continuous state and action spaces.

The code can be found here.

Limitations of the discretisation

It is relatively easy and efficient to discretise the state and action spaces for simple problems. However when the complexity increases, it is necessary to use smaller discretisation buckets and more of them in order to achieve a more accurate control of the agent and to receive a more precise information from the environment.

The problem with increasing the number of buckets is that the Q matrix becomes rapidly huge. For example an environment with 10,000 states and 1,000 actions per state would require a Q-table of 10 million cells! This present 2 problems:

  • First, the amount of memory required to save and update that table would increase as the number of states increases and go beyond the memory of the computer.
  • Second, the amount of time required to explore each state to create the required Q-table would be unrealistic

This is known as the curse of dimensionality. In a future article, we will look at other strategies to deal with this problem, such as the famous Deep Q Network.