Part 4 – Learning to use OpenAI Gym

The Gym library by OpenAI provides virtual environments that can be used to compare the performance of different reinforcement learning techniques. I will show here how to use it in Python.

Installation

Follow the instructions on the installation page. You will need Python 3.5+ to follow these tutorials.

The cart pole environment

We will be using the cart pole environment from the OpenAI Gym library.

This environment simulates an inverted pendulum where a pole is attached to a cart which moves along a frictionless track. At each time step, the agent can move the cart to the left or to the right. The pendulum starts upright, and the goal is to prevent it from falling over.

Here’s a example.

Here’s a successful implementation of reinforcement learning for the same problem in a real environment.

The cart-pole problem in a real environment.

Setting up the environment

Let’s start by creating the environment and by retrieving some useful information.

There are 2 possible actions that can be performed at each time step: move the cart to the left (0) or to the right (1).
There are 4 states that can be observed at each time step: the position of the cart, its velocity, the angle of the pole and the velocity of the pole at the tip. The range values of the states are given below:

 Min valueMax value
Cart position-4.84.8
Cart velocity-3e383e38
Pole angle-0.40.4
Pole velocity at tip-3e383e38

Implementing a random policy

We need to identify which series of actions maximise the total cumulative reward at the end of an episode. For the sake of simplicity, we will start by implementing a random policy, i.e. at each time step, the cart is either pushed to the right or to the left randomly.

Let’s start learning

We let the agent learn over 20 episodes of 100 time steps each. At each time steps, a random action is chosen by the policy. A reward of +1 is given for every time step that the pole remains upright and the cumulative reward is calculated at the end of the episode. The episode ends if:

  1. The pole angle is more than ±12°.
  2. The cart position is more than ±2.4 (i.e. the center of the cart reaches the edge of the display).
  3. The episode length is greater than 200.

The problem is considered solved when the average reward is greater than or equal to 195 over 100 consecutive trials.

Of course, because we are only taking random actions, we can’t expect any improvement overtime. The policy is very naive here, we will implement more complex policies later. The code implementing the random policy can be found here.

Example of output:

This video shows the cart pole environment taking random actions for 20 episodes (okay admittedly that’s not the most exciting video…).

Implementing a hard-coded policy

In the same effort to understand how to use OpenAI Gym, we can define other simple policies to decide what action to take at each time step. For example, instead of using a random policy, we can also hard-code the actions to take at each time steps. For example, we can impose the agent to push the cart to the left for the first 20 time steps and to the right for the other ones.

We can also decide to alternate left and right pushes at each time steps.

This is probably not a very efficient strategy either but you get the idea. In a later article I will explain how to define a more clever policy.

Leave a Reply

Your email address will not be published. Required fields are marked *