Part 10 – Challenges in reinforcement learning

Reinforcement learning (RL) has gained enormous popularity in the recent years, especially in robotics. It is maybe the most advanced tool to achieve truly independent machines (although self learning may get there first). However, it does not really work yet… Here are the main challenges (and current solutions) encountered by reinforcement learning nowadays.

Low sample efficiency

RL algorithms have a notoriously high sample complexity, i.e. they require a very large number of trial-and-error actions before being able to solve a task. Applying such methods to real robot arms would require a very large number of iterations, which is not always possible in practice.

For example in this work, over 580k grasping attempts were necessary before learning a successful strategy. We can learn how to drive a car in 30 hours but a RL agent would require millions of human training time. Most RL robotics problems so far are mainly limited to single object interactions.

Current solutions

The solutions to tackle this high sample complexity limitation include,

  • to build a virtual model of the robot using a physics simulator, use it to learn a policy and then deploy it on the real robot. See section on ”Limitation of virtual environments”.
  • to use model-based approaches. See section on ”Limitation of model-based approaches”.
  • to combine model-free and model-based approaches.
  • to use Soft Actor Critic (SAC), a sample-efficient model-free off-policy algorithm.
  • to parallelise learning across multiple robots
  • to use imitation learning (i.e. let a human manipulates the robot during a few trials to teach action sequences that solve the task successfully). For example this and this.

Limitation of virtual environments

In problems where the dynamics can be accurately captured by a simulation, pre-training the agent in a virtual simulated environment is an effective approach. However, such problems are often limited to narrow tasks and usually require extensive manual adjustments to work properly. Moreover, it is often challenging to build a virtual model that takes into account all the intricacies and imperfections of a physical robot. This is known as the Sim-to-Real gap. In particular, it is challenging to simulate the images that a robot will encounter based on visual sensing.

Current solutions

  • to randomise the visual inputs provided by the environment and train the policy to be robust to changes so that the real world would look like just another variation of the simulation.
  • to train a conditional GAN to transform the randomized images back to the canonical form of the original simulation that the policy was familiar with.

Limitation of model-based approaches

Model-free RL techniques learn a policy based on rewards obtained by interaction with the environment. The drawback is that it requires a high sample complexity. In contrast in model-based methods, the agent learns a model of the environment and use it to plan and improve its policy, thus dramatically reducing the number of interactions it needs with the environment. This leads to better sample complexity and more robust policies, see for example this paper. Even though model-based approaches reduce sample complexity, they suffer from a number of limitations such as,

Reward specification

RL problems require a reward to be defined and specified so that the agent can learn a policy for a given task. In robotics, it is not always straightforward to specify a reward as it requires a lot of domain knowledge that may not always be available. For example, for the task of inserting a book between two other books, designing a reward function based only on visual inputs can be challenging.

Current solution

Some solutions that tackle the reward specification limitation include,

  • to provide the agent with several images of the goal state and allow it to query a human to know whether current state is a goal state. For example, see this paper.
  • to let the agent define its own reward from visual inputs based on pixel recognition without human interaction.

Sparse reward

Many RL problems feature very sparse reward, i.e. the agent does not receive any intermediate rewards at each time step. It does not have any feedback on how to improve its performance during the episode and it must figure out by itself what action sequences lead to the final reward. The Mountain-car is a classic RL benchmark problem with sparse rewards.

Current solutions

  • Reward engineering = Reward shaping = Reward hacking = “Rew-art”. It consists in using domain knowledge to augment the sparse reward and transform it to a dense reward (i.e. the reward is always higher in states that are closer to the end goal). In the Mountain-car example, reward engineering would consist in adding the velocity of the car to the reward to encourage it to gather speed. See section “Limitation of reward engineering”.
  • Hindsight Experience Replay (HER) and Scheduled Auxiliary Control (SAC-X). It consists in leveraging failed attempts from the agent at reaching the final reward by replaying the failed episodes with with a different goal than the one the agent was trying to achieve. This allows the agent to learn even if the episode was unsuccessful.
  • Use Divergent Policy Search methods to explore the space of observable policies, see for example this paper. (novelty, surprise and diversity approaches).
  • Use curiosity-driven exploration, intrinsic motivation or count-based exploration to encourage the agent to explore new states.

Limitations of reward engineering

Reward engineering seems an attractive solution to the problem of sparse reward, however it faces the following limitations.

  • it assumes some a priori knowledge on how to solve the problem, which is not always available, especially for complex tasks.
  • an agent with modified engineered reward no longer aims at solving the initial task but instead optimises a proxy that will hopefully help with the learning process. This may compromise the performance relative to the true objective, and sometimes even lead to unexpected and unwanted behaviour.

Overfitting / failing to generalise

It is hard to prevent a RL agent to overfit a problem. It may learn to solve a task to super-human performances, however it will perform very poorly on other similar tasks.

A solution could be to train an agent on a large distribution of environments but that’s very computationally expensive.

Limitation in robotics: continuous action and state space

In order to solve specific tasks in a RL context, robots must receive and execute instructions continuously (or at least at very short time intervals). Some RL algorithm – such as Q learning – can only deal with discrete states and action space. In order to control robots, it is this necessary to discretise the continuous state and action spaces.
However, as the number of degrees-of-freedom increases, the number of discrete bins increases exponentially which can be prohibitive for computational resources. This is informally known as the curse of dimensionality.
A number of RL algorithms have been invented to tackle this problem, such as DDPG, TRPO, PPO, NAF or Branching Dueling Q-Network (BDQ).