Part 1 – What is reinforcement learning?

Reinforcement learning (RL) is a branch of machine learning that has gained a lot of momentum in the recent years. In this series, we will explain what is reinforcement learning and we will explore some implementation of the different types of RL techniques.


There are 3 main types of machine learning techniques: supervised learning, unsupervised learning and reinforcement learning. You may already be familiar with supervised learning, which consists in training a computer to make predictions give a set of labeled examples. On the other hand, unsupervised learning is attempting to find similarities in the data without using any labels. In reinforcement learning, the goal is to determine the actions that would maximize the total cumulative reward for a given problem.


More formally, Reinforcement Learning (RL) is defined as “a type of machine learning technique that enables an agent to learn in an interactive environment by trial and error using feedback from its own actions and experiences.”[1]. The agent learns to make an informed decision by interrogating an unknown environment. The agent improves its decision-making abilities over time by maximizing a long-term reward through trial-and-error.

Here are some key definitions:
Environment: Physical world in which the agent operates
State: Current situation of the agent
Reward: Feedback from the environment
Policy: Method to map agent’s state to actions
Value: Future reward that an agent would receive by taking an action in a particular state


Model free vs model based

Model free
Model free methods don’t build a model of the environment or reward; it just directly connects observations to actions (or values that are related to actions). The agent build directly a policy and/or a value function based on its observation of the environment.

Model based
The agent is given a model of the environment and uses this model to find the optimal actions. Model-based methods try to predict what the next observation and/or reward will be. Based on this prediction, the agent is trying to choose the best possible action to take, very often making such predictions multiple times to look more and more steps into the future.

Pros and cons
Usually pure model-based methods are used in deterministic environments, such as board games with strict rules. They require less samples so they also find nice applications in robotics. On the other hand, model-free methods are usually easier to train as it’s hard to build good models of complex environments with rich observations.

Value-based vs policy-based vs actor critic

Policy-based method
Policy-based methods are directly approximating the policy of the agent, that is, what actions the agent should carry out at every step. Policy is usually represented by probability distribution over the available actions. At each time step, the policy is updated (but the value function is not stored).
e.g. Policy gradient

Value-based method
The agent calculates the value of every possible action and chooses the action with the best value (the policy is not explicitly defined).
e.g. Q learning

Actor critic method
Actor-critic methods store both the policy and the value function at each time step.

Pros and cons
Policy-based methods have better convergence properties. The problem with value-based methods is that they can have a big oscillation while training (high variability). This is because the choice of action may change dramatically for an arbitrarily small change in the estimated action values.

Episodic vs infinite horizon problems

Episodic problem
There is a starting state and a terminal state which defines an episode (i.e. a list of states, actions, rewards, and new states).
e.g. Mario game, cartpole environment

Infinite horizon
There is no terminal state. The agent has to learn how to choose the best actions and simultaneously interacts with the environment.
e.g. Stock trading

Continuous vs discrete problems

The state space and action space can be either discrete or continuous. Grid-world problems such as mazes have a discrete state and discrete action space. On the other hand, robotic arms have continuous state and action spaces.

Continuous problems are more difficult to solve. One solution could be to discretise a continuous space into bins. However, this may throws away valuable information concerning the geometry of the domain. It may also face the problem of the curse of dimensionality if the number of bins necessary is too large.

Deterministic vs stochastic policy

Deterministic policy
If we’re in that state, this specific action will be executed.
action = policy(state)
e.g. greedy policy

Stochastic policy
If we’re in that state, there’s a probability P that this action will be executed. It allows to make random exploratory decisions.
(action|state) = P(A=action|S=state)
e.g. Softmax policy

Pseudo deterministic policy
If we’re in that state, we either execute a random action (exploration) or we execute the action following the target policy (exploitation).
e.g. epsilon-greedy policy

Monte Carlo vs Temporal Difference

Monte Carlo methods
The agent waits until the end of the episode to look at the total cumulative reward.

Temporal difference (TD)
The agent update the maximum expected future reward at the end of each action / time step.

On-policy vs off-policy methods (TD learning)

Off-policy methods
Off-policy methods learn the optimal policy independently of the agent’s actions. At every time step, 2 policies are updated:
– one that is learnt about and that becomes the optimal policy (target policy: exploitation)
– one that is more exploratory and is used to generate behavior (behaviour policy: exploration).
An off-policy agent does not always follow to target policy.
e.g. Q-learning

On-policy methods
On-policy methods learn the value of the policy being carried out by the agent including the exploration steps. The on-policy agent can take into account the costs associated with exploration.
e.g. SARSA

Pros and cons
Off-policy methods doesn’t depend on “freshness of data”. The agent can use very old data sampled from the environment several million steps ago, and this data will still be useful for learning. For this reason, off-policy methods can train on the previous large history of data (or even on human demonstrations) i.e. they can use a large experience replay buffer to make the data closer to being independent and identically distributed. However, they are usually slower to converge.
On-policy methods depend heavily on the training data to be sampled from to the current policy that is being learnt. Each time the policy is changed, even a little bit, we need to generate new samples. On-policy method require much more fresh data from the environment. However, they are usually faster to converge than off-policy methods.


We have just seen the definition of reinforcement learning and the main types of RL problems and methods. In the next post, I will talk about some common RL algorthms.

Leave a Reply

Your email address will not be published. Required fields are marked *