The Gym library by OpenAI provides virtual environments that can be used to compare the performance of different reinforcement learning techniques. I will show here how to use it in Python.

Installation

Follow the instructions on the installation page. You will need Python 3.5+ to follow these tutorials.

The cart pole environment

We will be using the cart pole environment from the OpenAI Gym library.

This environment simulates an inverted pendulum where a pole is attached to a cart which moves along a frictionless track. At each time step, the agent can move the cart to the left or to the right. The pendulum starts upright, and the goal is to prevent it from falling over.

Here’s a example.

Here’s a successful implementation of reinforcement learning for the same problem in a real environment.

Setting up the environment

Let’s start by creating the environment and by retrieving some useful information.

1 2 3 4 5 6 7 8 9 10 |
import gym env = gym.make('CartPole-v0') print(env.action_space) print(env.observation_space) print(env.observation_space.high) print(env.observation_space.low) |

There are 2 possible actions that can be performed at each time step: move the cart to the left (0) or to the right (1).

There are 4 states that can be observed at each time step: the position of the cart, its velocity, the angle of the pole and the velocity of the pole at the tip. The range values of the states are given below:

Min value | Max value | |
---|---|---|

Cart position | -4.8 | 4.8 |

Cart velocity | -3e38 | 3e38 |

Pole angle | -0.4 | 0.4 |

Pole velocity at tip | -3e38 | 3e38 |

Implementing a random policy

We need to identify which series of actions maximise the total cumulative reward at the end of an episode. For the sake of simplicity, we will start by implementing a random policy, i.e. at each time step, the cart is either pushed to the right or to the left randomly.

1 2 3 4 5 6 |
def policy(): """ return a random action: either 0 (left) or 1 (right)""" action = env.action_space.sample() return action |

Let’s start learning

We let the agent learn over 20 episodes of 100 time steps each. At each time steps, a random action is chosen by the policy. A reward of +1 is given for every time step that the pole remains upright and the cumulative reward is calculated at the end of the episode. The episode ends if:

- The pole angle is more than ±12°.
- The cart position is more than ±2.4 (i.e. the center of the cart reaches the edge of the display).
- The episode length is greater than 200.

The problem is considered solved when the average reward is greater than or equal to 195 over 100 consecutive trials.

Of course, because we are only taking random actions, we can’t expect any improvement overtime. The policy is very naive here, we will implement more complex policies later. The code implementing the random policy can be found here.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
nb_episodes = 20 nb_timesteps = 100 for episode in range(nb_episodes): # iterate over the episodes state = env.reset() # initialise the environment rewards = [] for t in range(nb_timesteps): # iterate over time steps env.render() # display the environment state, reward, done, info = env.step(policy()) # implement the action chosen by the policy rewards.append(reward) # add 1 to the rewards list if done: # the episode ends either if the pole is > 15 deg from vertical or the cart move by > 2.4 unit from the centre cumulative_reward = sum(rewards) print("episode {} finished after {} timesteps. Total reward: {}".format(episode, t+1, cumulative_reward)) break env.close() |

Example of output:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
episode 0 finished after 12 timesteps. Total reward: 12.0 episode 1 finished after 12 timesteps. Total reward: 12.0 episode 2 finished after 16 timesteps. Total reward: 16.0 episode 3 finished after 16 timesteps. Total reward: 16.0 episode 4 finished after 21 timesteps. Total reward: 21.0 episode 5 finished after 23 timesteps. Total reward: 23.0 episode 6 finished after 15 timesteps. Total reward: 15.0 episode 7 finished after 16 timesteps. Total reward: 16.0 episode 8 finished after 19 timesteps. Total reward: 19.0 episode 9 finished after 26 timesteps. Total reward: 26.0 episode 10 finished after 17 timesteps. Total reward: 17.0 episode 11 finished after 15 timesteps. Total reward: 15.0 episode 12 finished after 12 timesteps. Total reward: 12.0 episode 13 finished after 20 timesteps. Total reward: 20.0 episode 14 finished after 17 timesteps. Total reward: 17.0 episode 15 finished after 23 timesteps. Total reward: 23.0 episode 16 finished after 28 timesteps. Total reward: 28.0 episode 17 finished after 12 timesteps. Total reward: 12.0 episode 18 finished after 16 timesteps. Total reward: 16.0 episode 19 finished after 24 timesteps. Total reward: 24.0 |

This video shows the cart pole environment taking random actions for 20 episodes (okay admittedly that’s not the most exciting video…).

Implementing a hard-coded policy

In the same effort to understand how to use OpenAI Gym, we can define other simple policies to decide what action to take at each time step. For example, instead of using a random policy, we can also hard-code the actions to take at each time steps. For example, we can impose the agent to push the cart to the left for the first 20 time steps and to the right for the other ones.

1 2 3 4 5 6 7 8 9 |
def policy(t): action = 0 if t < 20: action = 0 # go left elif t >= 20: action = 1 # go right return action |

We can also decide to alternate left and right pushes at each time steps.

1 2 3 4 5 |
def policy(t): action = 0 if t%2 == 1: # if the time step is odd action = 1 return action |

This is probably not a very efficient strategy either but you get the idea. In a later article I will explain how to define a more clever policy.