It’s a list of concepts I found confusing, so it in no way gives you a comprehensive overview of RL. Check out the articles at the bottom.
Q(s,a) is equal to the summation of immediate reward after performing action
a while in state
s and the discounted expected future reward. It represents how good a certain action is in given state.
B. Difference between Value-iteration and Policy-iteration
Value iteration includes: finding optimal value function + one policy extraction.
Policy iteration includes: policy evaluation + policy improvement, and the two are repeated iteratively until policy converges. Policy iteration generally converges faster than value iteration.
Take a quick glance and move on.
Value Iteration :
We start with a random value function and then find a new (improved) value function in an iterative process, until reaching the optimal value function.
Value iteration computation of Bellman Equation:
Problems with Value Iteration :
- It’s slow – O(S^2A) per iteration
- The “max” at each state rarely changes
- The policy often converges long before the values
Policy Iteration :
Policy iteration computations of Bellman Equation:
- Evaluation : For fixed current policy p, find values with policy evaluation
In value iteration:
- Every iteration updates both the values and (implicitly) the policy
- We don’t track the policy, but taking the max over actions implicitly recomputes it
In policy iteration:
- We do several passes that update utilities with fixed policy (each pass is fast because we consider only one action, not all of them)
- After the policy is evaluated, a new policy is chosen (slow like a value iteration pass)
- The new policy will be better (or we’re done)
C. Off-policy vs On-policy
This is how Sarsa and Q-learning choose the next action (epsilon-greedy). Epsilon is usually set to 0.9, meaning that 90% of the time the Q-learning agent chooses the best action according to the Q-table, 10% of the time chooses a random action. It allows the agent to explore.
Our Q-table looks like this
def choose_action(observation): if np.random.uniform() < self.epsilon: # Choose the best action according to Q-table, find the highest value in row Q-table[observation,:]. # The column index is the action. state_action = self.q_table.loc[observation,:] # For the same state (row), there might be multiple identical Q action value, so we shuffle it. state_action = state_action.reindex(np.random.permutation(state_action.index)) action = state_action.argmax() else: # Choose a random action action = np.random.choice((self.actions))
D. DQN: Using NN to estimate Q(s,a) + Experience Replay
Main innovation over Q-learning: It has two separate networks: an action network and a target network. The action network selects action, its parameters are trainable.The target network is a history version of the action network, its parameters are untrainable. It is updated infrequently– every C steps.The target network computes the loss for every action during training.
Why not use just use one network for both estimations? The issue is that at every step of training, the Q-network’s values shift, and if we are using a constantly shifting set of values to adjust our network values, then the value estimations can easily spiral out of control. In order to make training more stable, the target network’s weights are fixed, and only periodically or slowly updated to the primary Q-networks values.
Continuous observation space (target network) , discrete action space (action network).
E. Value Function, Q Function, Advantage Function:
Advantage: A = Q(s,a) – V(s)
1. Why are the q-values of different states very close to each other for a given action ?
This should be fairly obvious. Consider two different cases. In the first one, the fruit falls from the top left corner and in the second one, the fruit falls from the top right corner. The agent in both cases is at the center. Then, consider the action of moving left. This action will give a high q-value in the first case and a low q-value in the second case.
2. How to solve the action gap problem ?
We can compare the Q-value for each action with the average of them (V) so that we know how good an action is relative to each other.
F. Policy Gradient: continuous action spaces
Main innovation: continuous action spaces.
pi is the policy.
log(Policy(s,a)) signifies how surprised state s is at the chosen action a. If the probability of Policy(s,a) is small, -
log(Policy(s,a)) is huge, meaning that the state s is very surprised. If we get a big V (reward), it implies that by choosing a rare move a, we get a better-than-expected reward, we need to make substantial changes to the parameters.
G. Actor- critic
The actor produces an action given the current state of the environment, and the critic produces a TD (Temporal-Difference) error signal given the state and resultant reward. If the critic is estimating the action-value function Q(s,a), it will also need the output of the actor. The output of the critic drives learning in both the actor and the critic.
H.What’s the difference between Reinforce and Actor-Critic?
a. Difference between policy gradient and TD
H. Deep Deterministic Policy Gradient = DQN + Actor-Critic
I. Policy Optimization
1.Trust Region Policy Optimization: Policy gradient objective function (Open AI 2015)
The surrogate loss in TRPO is a lower bound of the original objective – the expected cumulative return of the policy.
Use KL divergence between the old policy and updated policy as a measurement for trust region.
“Trust-region methods define a region around the current iterative within which they trust the model to be an adequate representation of the objective function, and then choose the step to be the approximate minimizer of the model in this region”. Intuitively, during our optimization procedure, after we decided the gradient direction, when doing line search we want to constrain our step length to be within a “trust region” so that the local estimation of the gradient/curvature remains to be “trusted”.
2. KL Divergence
3. Proximal Policy Optimization (Open AI 2017)
J. Asynchronous Actor-Critic Agents (A3C)
Once the worker’s experience history is large enough, we use it to determine discounted return and advantage, and use those to calculate value and policy losses. We also calculate an entropy (H) of the policy. This corresponds to the spread of action probabilities. If the policy outputs actions with relatively similar probabilities, then entropy will be high, but if the policy suggests a single action with a large probability then entropy will be low. We use the entropy as a means of improving exploration, by encouraging the model to be conservative regarding its sureness of the correct action.
Value Loss: L = Σ(R – V(s))²
Policy Loss: L = -log(π(s)) * A(s) – β*H(π)