Part 2 – Reinforcement learning algorithms

In this article, I will list and describe some important RL algorithms. Please refer to the previous article if you have any questions regarding RL vocabulary.

Please note this is a work-in-progress article.

Algorithm Model-free or model-based Agent type Policy Policy type Monte Carlo or Temporal difference (TD) Action space State space
Tabular Q-learning (= SARSA max)
Q learning lambda
Model free Value-based Off-policy Pseudo-deterministic (epsilon greedy) TD Discrete Discrete
SARSA lambda
Model free Value-based On-policy Pseudo-deterministic (epsilon greedy) TD Discrete Discrete
N step DQN
Double DQN
Noisy Network
Prioritized Replay buffer DQN
Dueling DQN
Catergorical DQN
Model free Value-based Off-policy Pseudo-deterministic (epsilon greedy)
Discrete Continuous
Cross-entropy method Model free Policy-based On-policy
Monte Carlo

REINFORCE (Vanilla policy gradient) Model free Policy-based On-policy(?) Stochastic policy Monte Carlo

Policy gradient softmax Model free

Stochastic policy

Natural Policy Gradient Model free

Stochastic policy

TRPO Model free
On-policy (?) Stochastic policy
Continuous Continuous
PPO Model free Policy-based On-policy (?) Stochastic policy
Continuous Continuous
A2C Model free Policy-based On-policy Stochastic policy TD Continuous (?)

DDPG (A2C family) Model free Policy-based Off-policy Deterministic policy
Continuous Continuous



Curiosity Model

NAF Model free



Dynamic programming


Model Predictive Control Model-based

PILCO Model-based

Policy search with Gaussian Process Model-based

Policy search with backpropagation Model-based

1. Q Learning

Q-Learning is an off-policy, model-free RL algorithm based on the Bellman Equation.
e.g. Q learning lambda

2. State-Action-Reward-State-Action (SARSA)

SARSA very much resembles Q-learning. The key difference between SARSA and Q-learning is that SARSA is an on-policy algorithm.
e.g. Sarsa lambda

3. Deep Q Network (DQN)

DQN is a Q-learning method with a deep network to estimate Q using a replay buffer and a target network.
e.g. N step DQN, Double DQN, Noisy Network, Prioritized Replay buffer DQN, Dueling DQN, Categorical DQN

4. Cross-entropy method (CEM)

CEM take “educated” random guesses on actions. It select top performers of guesses and use them as seeds for next round of guessing. CEM is a model-free, policy-based, and on-policy method.

5. Policy Gradients (PG)

It’s a policy-based method, unlike Q learning which is value-based.

e.g. REINFORCE, PG softmax, Natural Policy Gradient, Trust Region Policy Optimization (TRPO), Proximal Policy Optimization (PPO)

Policy gradient vs DQN

Policy gradients:

  • have a better convergence
  • are more effective in high dimensional action space (or even continuous action space)
  • don’t have the problem of perceptual aliasing (so the agent doesn’t get stuck repeating the same actions)
  • can learn stochastic policies (i.e. they output a probability distribution over actions as opposed to deterministic policies), which means:
  • No explicit exploration is needed. In Q-learning, we used an epsilon-greedy strategy to explore the environment and prevent our agent from getting stuck with non-optimal policy. Now, with probabilities returned by the network, the exploration is performed automatically. In the beginning, the network is initialized with random weights and the network returns uniform probability distribution. This distribution corresponds to random agent behavior.
  • No replay buffer is used. PG methods belong to the on-policy methods class, which means that we can’t train on data obtained from the old policy. This is both good and bad. The good part is that such methods usually converge faster. The bad side is they usually require much more interaction with the environment than off-policy methods such as DQN.
  • No target network is needed. Here we use Q-values, but they’re obtained from our experience in the environment. In DQN, we used the target network to break the correlation in Q-values approximation, but we’re not approximating it anymore. [1]

6. Actor-critic algorithm

This family of RL algorithm combines policy-based and value-based methods:

  • The critic measures how good the action taken is (value-based)
  • The actor controls how our agent behaves (policy-based)

e.g. Advantage Actor Critic (A2C), Asynchronous Advantage Actor-Critic (A3C), Soft actor-critic (SAC), Deep Deterministic Policy Gradient (DDPG), Distributed Distributional Deep Deterministic Policy Gradients (D4PG)

DDPG is an actor-critic approach for continuous actions. It a type of A2C but it’s off-policy and uses a deterministic policy. Unlike DQNs, it can learn learning policies in high dimensional, continuous action spaces.

7. Dataset Aggregation (DAgger)

It uses human labels to improve imitation learning.

8. Monte Carlo Tree Search (MCTS)

Its searches discrete action spaces using a search tree with an exploration tree policy.

e.g. AlphaZero

9. Curiosity Model

10. Normalized Advantage Functions (NAF)

11. Model-based Reinforcement Learning

e.g. Model Predictive Control (MPC), Policy search with backpropagation, PILCO (probabilistic inference for learning control), Policy search with Gaussian Process, Guided Policy Search (GPS), Dyna-Q


We have just seen some of the most used RL algorithms. In the next article, we will look at the challenges and application of RL for robotic applications.

Leave a Reply

Your email address will not be published. Required fields are marked *