Policy Gradients

Recall that the goal of Reinforcement Learning is:

θ^{*} = ar g θ max t = 0 \sum T E_{(s_{t}, a_{t}) \sim p_{θ} (s_{t}, a_{t})} [r (s_{t}, a_{t})] = ar g θ max This term can be denoted by J (θ) E_{τ \sim p_{θ} (τ)} [t = 0 \sum T r (s_{t}, a_{t})]

$J (θ)$ can be evaluated from samples as follows:

J (θ) = E_{τ \sim p_{θ} (τ)} [t = 0 \sum T r (s_{t}, a_{t})] \approx \frac{1}{N} i = 1 \sum N t = 0 \sum T r (s_{i, t}, a_{i, t})

i.e. we collect $N$ trajectories by running policy $π_{θ}$ and estimate the value of the policy $J (θ)$ . This is essentially a Monte-Carlo estimate. Letting $r (τ) = t = 0 \sum T r (s_{t}, a_{t})$ , we can find the $\nabla_{θ} J (θ)$ :

\nabla_{θ} J (θ) = \nabla_{θ} E_{τ \sim p_{θ} (τ)} [r (τ)] = \nabla_{θ} \int p_{θ} (τ) \cdot r (τ) \cdot d τ = \int (\nabla_{θ} p_{θ} (τ)) \cdot r (τ) \cdot d τ

We can use the identity $p_{θ} (τ) \nabla_{θ} lo g p_{θ} (τ) = p_{θ} (τ) \frac{\nabla _{θ} p _{θ}}{p _{θ}} = \nabla_{θ} p_{θ} (τ)$ to simplify the above equation as follows:

\nabla_{θ} J (θ) = \int (p_{θ} (τ) \cdot \nabla_{θ} lo g p_{θ} (τ)) \cdot r (τ) \cdot d τ = E_{p_{θ} (τ)} [\nabla_{θ} lo g p_{θ} (τ) \cdot r (τ)] ... [1]

We already know that:

p_{θ} (τ) ⟹ lo g p_{θ} (τ) ⟹ \nabla_{θ} lo g p_{θ} (τ) = p_{θ} (s_{0}, a_{0}, s_{1}, a_{1}, ..., s_{T}, a_{T}, s_{T + 1}) = ρ (s_{0}) t = 0 \prod T π_{θ} (a_{t} ∣ s_{t}) \cdot p (s_{t + 1} ∣ s_{t}, a_{t}) = lo g (ρ (s_{0}) t = 0 \prod T π_{θ} (a_{t} ∣ s_{t}) \cdot p (s_{t + 1} ∣ s_{t}, a_{t})) = independent of θ lo g ρ (s_{0}) + t = 0 \sum T lo g π_{θ} (a_{t} ∣ s_{t}) + independent of θ lo g p (s_{t + 1} ∣ s_{t}, a_{t}) = \nabla_{θ} t = 0 \sum T lo g π_{θ} (a_{t} ∣ s_{t}) = t = 0 \sum T \nabla_{θ} lo g π_{θ} (a_{t} ∣ s_{t})

Thus equation $[1]$ becomes

\nabla_{θ} J (θ) = E_{p_{θ} (τ)} [t = 0 \sum T \nabla_{θ} lo g π_{θ} (a_{t} ∣ s_{t}) \cdot t = 0 \sum T r (s_{t}, a_{t})] \approx Generate Samples \frac{1}{N} i = 1 \sum N (t = 0 \sum T \nabla_{θ} lo g π_{θ} (a_{t}^{i} ∣ s_{t}^{i})) Estimate Rewards (t = 0 \sum T r (s_{t}^{i}, a_{t}^{i}))

Looking back at the anatomy of RL, The first part of the equation corresponds to the trajectory collection phase and the last part corresponds to the value estimation part. The policy update part is essentially:

θ \leftarrow θ + α \nabla_{θ} J (θ)

where $α$ is the learning rate.

REINFORCE Algorithm

Goal: Obtain the optimal policy $π^{*}$
While $∣ V^{π_{θ}} - V^{π^{*}} ∣ < ϵ$ :
1. Collect $N$ trajectories ${τ^{i}}_{i = 1}^{N}$ by executing policy $π_{θ} (a_{t} ∣ s_{t})$ in the environment
2. Estimate $\nabla_{θ} J (θ) \approx \frac{1}{N} i = 1 \sum N (t = 0 \sum T \nabla_{θ} lo g π_{θ} (a_{t}^{i} ∣ s_{t}^{i})) (t = 0 \sum T r (s_{t}^{i}, a_{t}^{i}))$
3. Update $θ \leftarrow θ + α \nabla_{θ} J (θ)$ , i.e. take step towards the positive gradient

Comments:

Recall that $\nabla_{θ} J_{M L E} (θ) \approx \frac{1}{N} i = 1 \sum N (t = 0 \sum T \nabla_{θ} lo g π_{θ} (a_{t}^{i} ∣ s_{t}^{i}))$ , i.e. finding the maximum likelihood estimate (MLE) using ERM in behavior cloning (BC). Comparing with the PG algorithm, we can see that there is an extra term $r (τ)$ in PG which is “weighing” the action log likelihoods given states. Thus PG can be thought of as a “trial-and-error” algorithm where the actions which lead to more rewarding trajectories are given more importance during gradient update by reinforcing the behavior to take more rewarding actions conditional on states.
PG algorithms have very high variance. For instance, we can take a running example: consider the reward system +1 for each time-step (fixed horizon length of 100), and only if the agent reaches the goal, then there is an extra reward of +1. Now consider an alternate situation where at each step there is 0 reward instead for survival. PG algorithms work better in the latter situation, since we can interpret it as giving 100 weightage to all bad trajectories and 101 to good trajectories.
To reduce variance, we can simply introduce the concept of baselines: $\nabla_{θ} J (θ) \approx \frac{1}{N} i = 1 \sum N (t = 0 \sum T \nabla_{θ} lo g π_{θ} (a_{t}^{i} ∣ s_{t}^{i})) (t = 0 \sum T r (s_{t}^{i}, a_{t}^{i}) - b)$ which is essentially just subtracting a baseline value to give more importance to good trajectories, i.e. say if we subtract $b = 100$ , then good trajectories get 1 weightage and everything else gets 0 weightage. We can simply set $b = \frac{1}{N} i = 1 \sum N r (τ)$ since: $E [\nabla_{θ} lo g p_{θ} (τ) b] = \int p_{θ} (τ) \nabla_{θ} lo g p_{θ} (τ) b \cdot d τ = b \int \nabla_{θ} p_{θ} (τ) d τ = b \nabla_{θ} \int p_{θ} (τ) d τ = b \nabla_{θ} 1 = 0$ which essentially means that subtracting a baseline is unbiased in expectation. The baseline can also be learned during training (using back-propagation).
One other way to reduce variance is by noticing the causality during calculating the rewards, i.e. a policy at time $t^{'}$ cannot affect the rewards obtained at time $t$ where $t < t^{'}$ . Thus gradient calculation equation can be modified as:
$\nabla_{θ} J (θ) \approx \frac{1}{N} i = 1 \sum N t = 0 \sum T \nabla_{θ} lo g π_{θ} (a_{t}^{i} ∣ s_{t}^{i}) \hat{Q} (s_{t^{'}}^{i}, a_{t^{'}}^{i}) (t^{'} = t \sum T r (s_{t^{'}}^{i}, a_{t^{'}}^{i}))$
where $\hat{Q} (s_{t^{'}}^{i}, a_{t^{'}}^{i})$ is essentially the MC estimate of the state-action value function. Now notice that in this situation, we cannot introduce a constant baseline $b$ , since for each state $s_{t}$ the baseline will be different. For our running example, say at $s_{0}$ , the baseline value should be 100, but at $s_{50}$ the baseline value should be 50, and finally at $s_{100}$ the baseline value should be 0. Therefore there is a need to learn state dependent baselines:

\nabla_{θ} J (θ) \approx \frac{1}{N} i = 1 \sum N t = 0 \sum T \nabla_{θ} lo g π_{θ} (a_{t}^{i} ∣ s_{t}^{i}) (\hat{Q} (s_{t^{'}}^{i}, a_{t^{'}}^{i}) - b (s_{t}))

ANOVA (for point 3)

From $[1]$ and including baselines we get $\nabla_{θ} J (θ) = E_{τ \sim p_{θ} (τ)} [\nabla_{θ} lo g p_{θ} (τ) (r (τ) - b)]$ . Now the variance is:

Va r = E_{τ \sim p_{θ} (τ)} [(\nabla_{θ} lo g p_{θ} (τ) (r (τ) - b))^{2}] - E_{τ \sim p_{θ} (τ)} [\nabla_{θ} lo g p_{θ} (τ) (r (τ) - b)]^{2} = E_{τ \sim p_{θ} (τ)} [(\nabla_{θ} lo g p_{θ} (τ) (r (τ) - b))^{2}] - E_{τ \sim p_{θ} (τ)} [\nabla_{θ} lo g p_{θ} (τ) (r (τ))]^{2}

since baselines are unbiased in expectation

\frac{d Va r}{d b} ⟹ b = \frac{d}{d b} E [g (τ)^{2} (r (τ) - b)^{2}] + 0 = \frac{d}{d b} (E [g (τ)^{2} r (τ)^{2}] - 2 E [g (τ)^{2} r (τ) b + b^{2} E [g (τ)^{2}]]) = 0 - 2 E [g (τ)^{2} r (τ)] + 2 b E [g (τ)^{2}] = 0 = \frac{E [ g ( τ ) ^{2} r ( τ )]}{E [ g ( τ ) ^{2} ]}

which is essentially just the expected rewards, weighted by gradient magnitudes

Off-policy Policy Gradients

The problem with the above algorithm is that it is completely on-policy, i.e. once the update happens, all the collected trajectories are discarded and new trajectories must to collected to perform the next policy gradient update. It would be more efficient if we did not have to discard all the previously collected trajectories since it took effort to collect the same. Hence, we look at ways to adapt the PG algorithm for the off-policy case.

We start with Importance Sampling, say we have an expectation of some function of $x$ w.r.t. distribution $p (x)$ :

E_{x \sim p (x)} [f (x)] = x \int p (x) \cdot f (x) \cdot d x = x \int \frac{q ( x )}{q ( x )} \cdot p (x) \cdot f (x) \cdot d x = x \int q (x) \cdot \frac{p ( x )}{q ( x )} \cdot f (x) \cdot d x = E_{x \sim q (x)} [\frac{p ( x )}{q ( x )} \cdot f (x)]

We can choose any distribution $q (x)$ (it’s under our control) as long as the support of the distribution is same as that of the $p (x)$ and $q (x) > 0$ for all such $x$ where $p (x) > 0$ .

Assuming we have trajectories from some other distribution $\overset{p}{ˉ}$ instead of $p_{θ}$ , we can modify the PG algorithm as follows:

J (θ) = E_{τ \sim \overset{p}{ˉ} (τ)} [\frac{p _{θ} ( τ )}{p ˉ ( τ )} \cdot r (τ)]

Now,

\frac{p _{θ} ( τ )}{p ˉ ( τ )} = \frac{ρ ( s _{0} ) t = 0 \prod T π _{θ} ( a _{t} ∣ s _{t} ) \cdot p ( s _{t + 1} ∣ a _{t + 1} )}{ρ ( s _{0} ) t = 0 \prod T π ˉ _{θ} ( a _{t} ∣ s _{t} ) \cdot p ( s _{t + 1} ∣ a _{t + 1} )} = \frac{t = 0 \prod T π _{θ} ( a _{t} ∣ s _{t} )}{t = 0 \prod T π ˉ _{θ} ( a _{t} ∣ s _{t} )} ... [2]

Let’s assume that our current policy (after the $t^{th}$ ) policy update is parameterized by $θ^{t}$ (that is used for collecting new transitions) whereas the policies that collected the previous samples (which are stored in a buffer) can be parameterized by $θ^{t - k}$ . In doing so we are overloading some notations, therefore we will assume that the previously collected dataset was collected by some policy $π^{θ}$ and the current policy is $π^{θ^{'}}$ (note that by convention objects in the past are denoted by a letter and for corresponding future objects a prime is added, e.g. if current state is $s$ , next state is often denoted as $s^{'}$ ). Therefore we would like to update $π_{θ^{'}}$ using samples from $π_{θ} :$

J (θ^{'}) \nabla_{t h e t a^{'}} J (θ^{'}) = E_{τ \sim p_{θ} (τ)} [\frac{p _{θ^{'}} ( τ )}{p _{θ} ( τ )} \cdot r (τ)] = E_{τ \sim p_{θ} (τ)} [\frac{\nabla _{θ^{'}} p _{θ^{'}} ( τ )}{p _{θ} ( τ )} \cdot r (τ)] = E_{τ \sim p_{θ} (τ)} [\frac{p _{θ^{'}} ( τ )}{p _{θ} ( τ )} \cdot \nabla_{θ^{'}} lo g p_{θ^{'}} (τ) \cdot r (τ)]

Replacing with $[2]$ and expanding the remaining terms we get:

\nabla_{θ^{'}} J (θ^{'}) = E_{τ \sim p_{θ} (τ)} [(t = 0 \prod T \frac{π _{θ^{'}} ( a _{t} ∣ s _{t} )}{π _{θ} ( a _{t} ∣ s _{t} )}) (t = 0 \sum T \nabla_{θ^{'}} lo g π_{θ^{'}} (a_{t} ∣ s_{t})) (t = 0 \sum T r (s_{t}, a_{t}))]

Now in order to decrease the variance in estimates, we can modify the above by introducing the notion of causality. We have already seen how the current reward is independent of the trajectory history. In this case notice further that the importance weights also have a notion of causality involved, i.e. current importance weights are not affected by future actions:

\nabla_{θ^{'}} J (θ^{'}) = E_{τ \sim p_{θ} (τ)} t = 0 \sum T \nabla_{θ^{'}} lo g π_{θ^{'}} (a_{t} ∣ s_{t}) (t^{'} = 0 \prod t \frac{π _{θ^{'}} ( a _{t^{'}} ∣ s _{t^{'}} )}{π _{θ} ( a _{t^{'}} ∣ s _{t^{'}} )}) t^{'} = t \sum T r (s_{t^{'}}, a_{t^{'}}) t^{''} = t \prod t^{'} \frac{π _{θ} ( a _{t^{''}} ∣ s _{t^{''}} )}{π _{θ} ( a _{t^{''}} ∣ s _{t^{''}} )}

If we ignore the final term, we can recover an algorithm called policy iteration (cite) which has some nice convergence guarantees. However even in the reduced expression

\nabla_{θ^{'}} J (θ^{'}) = E_{τ \sim p_{θ} (τ)} [t = 0 \sum T \nabla_{θ^{'}} lo g π_{θ^{'}} (a_{t} ∣ s_{t}) (t^{'} = 0 \prod t \frac{π _{θ^{'}} ( a _{t^{'}} ∣ s _{t^{'}} )}{π _{θ} ( a _{t^{'}} ∣ s _{t^{'}} )}) (t^{'} = t \sum T r (s_{t^{'}}, a_{t^{'}}))]

notice that the second term is actually exponentially hard to compute and doesn’t work very well for long horizon tasks. Hence we can look into something like a “first-order” approximation of the same, which is essentially just modifying the on-policy algorithm by adding importance weights:

\nabla_{θ} J_{On Policy} (θ) \nabla_{θ^{'}} J_{Off Policy} (θ^{'}) \approx \frac{1}{N} i = 1 \sum N t = 1 \sum T \nabla_{θ} lo g π_{θ} (a_{t}^{i} ∣ s_{t}^{i}) \hat{Q}_{t}^{i} \approx \frac{1}{N} i = 1 \sum N t = 1 \sum T \frac{π _{θ^{'}} ( s _{t}^{i} , a _{t}^{i} )}{π _{θ} ( s _{t}^{i} , a _{t}^{i} )} \nabla_{θ^{'}} lo g π_{θ^{'}} (a_{t}^{i} ∣ s_{t}^{i}) \hat{Q}_{t}^{i} \approx \frac{1}{N} i = 1 \sum N t = 1 \sum T \frac{π _{θ^{'}} ( s _{t}^{i} ) π _{θ^{'}} ( a _{t}^{i} ∣ s _{t}^{i} )}{π _{θ} ( s _{t}^{i} ) π _{θ} ( a _{t}^{i} ∣ s _{t}^{i} )} \nabla_{θ^{'}} lo g π_{θ^{'}} (a_{t}^{i} ∣ s_{t}^{i}) \hat{Q}_{t}^{i} \approx \frac{1}{N} i = 1 \sum N t = 1 \sum T \frac{π _{θ^{'}} ( a _{t}^{i} ∣ s _{t}^{i} )}{π _{θ} ( a _{t}^{i} ∣ s _{t}^{i} )} \nabla_{θ^{'}} lo g π_{θ^{'}} (a_{t}^{i} ∣ s_{t}^{i}) \hat{Q}_{t}^{i}

It is reasonable to assume that the marginals are similar (cite) and can be ignored.

Lecture Notes

Explorer

Policy Gradients

REINFORCE Algorithm

ANOVA (for point 3)

Off-policy Policy Gradients

Straight Through Gumble Softmax (STGS) for PG Algorithms

Graph View

Table of Contents

Backlinks