In this article, I will list and describe some important RL algorithms. Please refer to the previous article if you have any questions regarding RL vocabulary.
Please note this is a workinprogress article.
Algorithm  Modelfree or modelbased  Agent type  Policy  Policy type  Monte Carlo or Temporal difference (TD)  Action space  State space 
Tabular
Qlearning
(=
SARSA max)
Q learning lambda  Model free  Valuebased  Offpolicy  Pseudodeterministic (epsilon greedy)  TD  Discrete  Discrete 
SARSA
SARSA lambda  Model free  Valuebased  Onpolicy  Pseudodeterministic (epsilon greedy)  TD  Discrete  Discrete 
DQN
N step DQN Double DQN Noisy Network Prioritized Replay buffer DQN Dueling DQN Catergorical DQN  Model free  Valuebased  Offpolicy  Pseudodeterministic (epsilon greedy) 
 Discrete  Continuous 
Crossentropy method  Model free  Policybased  Onpolicy 
 Monte Carlo 


REINFORCE (Vanilla policy gradient)  Model free  Policybased  Onpolicy  Stochastic policy  Monte Carlo 


Policy gradient softmax  Model free 

 Stochastic policy 



Natural Policy Gradient  Model free 

 Stochastic policy 



TRPO  Model free 
 Onpolicy (?)  Stochastic policy 
 Continuous  Continuous 
PPO  Model free  Policybased  Onpolicy (?)  Stochastic policy 
 Continuous  Continuous 
A2C  Model free  Policybased  Onpolicy  Stochastic policy  TD  Continuous (?) 

A3C 
 Policybased  Onpolicy 




DDPG (A2C family)  Model free  Policybased  Offpolicy  Deterministic policy 
 Continuous  Continuous 
D4PG 







SAC  Model free  Policybased  Offpolicy 




DynaQ 







Curiosity Model 







NAF  Model free 



 Continuous 

DAgger 







MCTS 







Dynamic programming 







GPS 







Model Predictive Control  Modelbased 






PILCO  Modelbased 






Policy search with Gaussian Process  Modelbased 






Policy search with backpropagation  Modelbased 






1. Q Learning
QLearning is an offpolicy, modelfree RL algorithm based on the Bellman Equation.
e.g. Q learning lambda
2. StateActionRewardStateAction (SARSA)
SARSA very much resembles Qlearning. The key difference between SARSA and Qlearning is that SARSA is an onpolicy algorithm.
e.g. Sarsa lambda
3. Deep Q Network (DQN)
DQN is a Qlearning method with a deep network to estimate Q using a replay buffer and a target network.
e.g. N step DQN, Double DQN, Noisy Network, Prioritized Replay buffer DQN, Dueling DQN, Categorical DQN
4. Crossentropy method (CEM)
CEM take “educated” random guesses on actions. It select top performers of guesses and use them as seeds for next round of guessing. CEM is a modelfree, policybased, and onpolicy method.
5. Policy Gradients (PG)
It’s a policybased method, unlike Q learning which is valuebased.
e.g. REINFORCE, PG softmax, Natural Policy Gradient, Trust Region Policy Optimization (TRPO), Proximal Policy Optimization (PPO)
Policy gradient vs DQN
Policy gradients:
 have a better convergence
 are more effective in high dimensional action space (or even continuous action space)
 don’t have the problem of perceptual aliasing (so the agent doesn’t get stuck repeating the same actions)
 can learn stochastic policies (i.e. they output a probability distribution over actions as opposed to deterministic policies), which means:
 No explicit exploration is needed. In Qlearning, we used an epsilongreedy strategy to explore the environment and prevent our agent from getting stuck with nonoptimal policy. Now, with probabilities returned by the network, the exploration is performed automatically. In the beginning, the network is initialized with random weights and the network returns uniform probability distribution. This distribution corresponds to random agent behavior.
 No replay buffer is used. PG methods belong to the onpolicy methods class, which means that we can’t train on data obtained from the old policy. This is both good and bad. The good part is that such methods usually converge faster. The bad side is they usually require much more interaction with the environment than offpolicy methods such as DQN.
 No target network is needed. Here we use Qvalues, but they’re obtained from our experience in the environment. In DQN, we used the target network to break the correlation in Qvalues approximation, but we’re not approximating it anymore. [1]
6. Actorcritic algorithm
This family of RL algorithm combines policybased and valuebased methods:
 The critic measures how good the action taken is (valuebased)
 The actor controls how our agent behaves (policybased)
e.g. Advantage Actor Critic (A2C), Asynchronous Advantage ActorCritic (A3C), Soft actorcritic (SAC), Deep Deterministic Policy Gradient (DDPG), Distributed Distributional Deep Deterministic Policy Gradients (D4PG)
DDPG is an actorcritic approach for continuous actions. It a type of A2C but it’s offpolicy and uses a deterministic policy. Unlike DQNs, it can learn learning policies in high dimensional, continuous action spaces.
7. Dataset Aggregation (DAgger)
It uses human labels to improve imitation learning.
8. Monte Carlo Tree Search (MCTS)
Its searches discrete action spaces using a search tree with an exploration tree policy.
e.g. AlphaZero
9. Curiosity Model
10. Normalized Advantage Functions (NAF)
11. Modelbased Reinforcement Learning
e.g. Model Predictive Control (MPC), Policy search with backpropagation, PILCO (probabilistic inference for learning control), Policy search with Gaussian Process, Guided Policy Search (GPS), DynaQ
Conclusion
We have just seen some of the most used RL algorithms. In the next article, we will look at the challenges and application of RL for robotic applications.