Reinforcement Learning A-Z

It’s a list of concepts I found confusing, so it in no way gives you a comprehensive overview of RL. Check out the articles at the bottom.


A.  Q(s,a)

Q(s,a) is equal to the summation of immediate reward after performing action a while in state s and the discounted expected future reward. It represents how good a certain action is in given state.

Screen Shot 2018-05-08 at 8.12.07 PM.png



B. Difference between Value-iteration and Policy-iteration

In short:

Value iteration includes: finding optimal value function + one policy extraction.

Policy iteration includes: policy evaluation + policy improvement, and the two are repeated iteratively until policy converges. Policy iteration generally converges faster than value iteration.

Take a quick glance and move on.

enter image description here

Value Iteration :

We start with a random value function and then find a new (improved) value function in an iterative process, until reaching the optimal value function.

Value iteration computation of Bellman Equation:

Problems with Value Iteration :

  1. It’s slow – O(S^2A) per iteration
  2. The “max” at each state rarely changes
  3. The policy often converges long before the values

Policy Iteration :

Policy iteration computations of Bellman Equation:

  1. Evaluation : For fixed current policy p, find values with policy evaluation
    2.  Improvement : For fixed values, get a better policy using policy extraction (One-step look-ahead)
   3. Repeat step 1 and 2 until convergence


In value iteration:

  • Every iteration updates both the values and (implicitly) the policy
  • We don’t track the policy, but taking the max over actions implicitly recomputes it

In policy iteration:

  • We do several passes that update utilities with fixed policy (each pass is fast because we consider only one action, not all of them)
  • After the policy is evaluated, a new policy is chosen (slow like a value iteration pass)
  • The new policy will be better (or we’re done)


  1. Quora answer
  2. stackoverflow answer
  4. Difference between Value iteration and Policy iteration | Reinforced learning | MDP

C. Off-policy vs On-policy

SARSA updates its Q-values using the Q-value of the next state s′ and the action a’ chosen according to the current policy.  SARSA immediately takes action a’.
Q-learning gives a more optimistic Q(s,a) than SARSA since it assumes the future reward is the maximum of all future rewards despite not necessarily taking the action that leads to the maximum reward.

This is how Sarsa and Q-learning choose the next action (epsilon-greedy). Epsilon is usually set to 0.9, meaning that 90% of the time the Q-learning agent chooses the best action according to the Q-table, 10% of the time chooses a random action. It allows the agent to explore.

Our Q-table looks like this


def choose_action(observation):
    if np.random.uniform() < self.epsilon: 

# Choose the best action according to Q-table, find the highest value in row Q-table[observation,:].
# The column index is the action.
        state_action = self.q_table.loc[observation,:]

# For the same state (row), there might be multiple identical Q action value, so we shuffle it.
        state_action = state_action.reindex(np.random.permutation(state_action.index))
        action = state_action.argmax()

# Choose a random action
       action = np.random.choice((self.actions))
 Q-learning agent would choose the shortest (highest Q-value) but potentially more dangerous route because it takes the max(future reward), even if the action it actually takes leads to fire. It is off-policy because it doesn’t use the policy that it is improving on, aka, it doesn’t take the a’ in Q(s,a,s’,a’).
Think of it this way, Q-learning agent assumes that he will be rich for a long time, so he bought 10 Tesla, but he doesn’t necessarily choose the strategy that will sustain his spending habit.  Sarsa listens to the advice of his financial advisor (= uses the action prescribed by the current policy) and spends within the his projected next quarter earning. Q-learning is reckless, Sarsa is conservative.
Screen Shot 2018-05-08 at 8.55.06 PM.png

Morvan: Q-learning vs Sarsa (Chinese)

D. DQN:  Using NN to estimate Q(s,a) + Experience Replay



        Main innovation over Q-learning:  It has two separate networks: an action network and a target network. The action network selects action, its parameters are trainable.The target network is a history version of the action network, its parameters are untrainable. It is updated infrequently– every C steps.The target network computes the loss for every action during training.

       Why not use just use one network for both estimations? The issue is that at every step of training, the Q-network’s values shift, and if we are using a constantly shifting set of values to adjust our network values, then the value estimations can easily spiral out of control.  In order to make training more stable, the target network’s weights are fixed, and only periodically or slowly updated to the primary Q-networks values.

      Continuous observation space (target network) , discrete action space (action network).

Arthur Juliani’s Deep Q-Networks and Beyond


E. Value Function, Q Function, Advantage Function:


Screen Shot 2018-05-15 at 4.25.33 PM.png


Advantage: A = Q(s,a) – V(s)

1. Why are the q-values of different states very close to each other for a given action ?

This should be fairly obvious. Consider two different cases. In the first one, the fruit falls from the top left corner and in the second one, the fruit falls from the top right corner. The agent in both cases is at the center. Then, consider the action of moving left. This action will give a high q-value in the first case and a low q-value in the second case.

2. How to solve the action gap problem ?

We can compare the Q-value for each action with the average of them (V) so that we know how good an action is relative to each other.

View story at

View story at



F. Policy Gradient: continuous action spaces

Screen Shot 2018-05-15 at 4.27.33 PM.png

Read this first.


Main innovation: continuous action spaces.

pi is the policy. log(Policy(s,a)) signifies how surprised state s is at the chosen action a. If the probability of Policy(s,a) is small, -log(Policy(s,a)) is huge, meaning that the state s is very surprised. If we get a big V (reward), it implies that by choosing a rare move a, we get a better-than-expected reward, we need to make substantial changes to the parameters.

Morvan: Policy Gradient (Chinese)


G. Actor- critic

Must read:

Screen Shot 2018-05-15 at 6.40.43 PM.png



The actor produces an action given the current state of the environment, and the critic produces a TD (Temporal-Difference) error signal given the state and resultant reward. If the critic is estimating the action-value function Q(s,a), it will also need the output of the actor. The output of the critic drives learning in both the actor and the critic.

H.What’s the difference between Reinforce and Actor-Critic?

a. Difference between policy gradient and TD

Screen Shot 2018-05-15 at 6.34.16 PM.png

H. Deep Deterministic Policy Gradient = DQN + Actor-Critic



I. Policy Optimization 

1.Trust Region Policy Optimization: Policy gradient objective function (Open AI 2015)

The surrogate loss in TRPO is a lower bound of the original objective – the expected cumulative return of the policy.

Use KL divergence between the old policy and updated policy as a measurement for trust region.

“Trust-region methods define a region around the current iterative within which they trust the model to be an adequate representation of the objective function, and then choose the step to be the approximate minimizer of the model in this region”. Intuitively, during our optimization procedure, after we decided the gradient direction, when doing line search we want to constrain our step length to be within a “trust region” so that the local estimation of the gradient/curvature remains to be “trusted”.

Trust Region Policy Optimization

2. KL Divergence

Screen Shot 2018-05-09 at 12.31.59 AM.png

Screen Shot 2018-05-09 at 12.32.17 AM.png

Kullback-Leibler Divergence Explained

3.  Proximal Policy Optimization (Open AI  2017)

Screen Shot 2018-05-09 at 12.43.03 AM.png

Open AI



J. Asynchronous Actor-Critic Agents (A3C)

Asynchronous Advantage Actor-Critic (A3C) (Tensorflow)Once the worker’s experience history is large enough, we use it to determine discounted return and advantage, and use those to calculate value and policy losses. We also calculate an entropy (H) of the policy. This corresponds to the spread of action probabilities. If the policy outputs actions with relatively similar probabilities, then entropy will be high, but if the policy suggests a single action with a large probability then entropy will be low. We use the entropy as a means of improving exploration, by encouraging the model to be conservative regarding its sureness of the correct action.

Value Loss: L = Σ(R – V(s))²

Policy Loss: L = -log(π(s)) * A(s) – β*H(π)



General References:

1.Introduction to Various Reinforcement Learning Algorithms. Part I (Q-Learning, SARSA, DQN, DDPG)(Recommend!!)

2. Introduction to Various Reinforcement Learning Algorithms. Part II (TRPO, PPO)

3. Probably the most comprehensive RL tutorial on Github

4.Great RL tutorial in Chinese

View story at




View story at

View story at


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s