Recall that the goal of Reinforcement Learning is:
can be evaluated from samples as follows:
i.e. we collect trajectories by running policy and estimate the value of the policy . This is essentially a Monte-Carlo estimate. Letting , we can find the :
We can use the identity to simplify the above equation as follows:
We already know that:
Thus equation becomes
Looking back at the anatomy of RL, The first part of the equation corresponds to the trajectory collection phase and the last part corresponds to the value estimation part. The policy update part is essentially:
where is the learning rate.
REINFORCE Algorithm
Goal: Obtain the optimal policy
While :
1. Collect trajectories by executing policy in the environment
2. Estimate
3. Update , i.e. take step towards the positive gradient
Comments:
- Recall that , i.e. finding the maximum likelihood estimate (MLE) using ERM in behavior cloning (BC). Comparing with the PG algorithm, we can see that there is an extra term in PG which is “weighing” the action log likelihoods given states. Thus PG can be thought of as a “trial-and-error” algorithm where the actions which lead to more rewarding trajectories are given more importance during gradient update by reinforcing the behavior to take more rewarding actions conditional on states.
- PG algorithms have very high variance. For instance, we can take a running example: consider the reward system +1 for each time-step (fixed horizon length of 100), and only if the agent reaches the goal, then there is an extra reward of +1. Now consider an alternate situation where at each step there is 0 reward instead for survival. PG algorithms work better in the latter situation, since we can interpret it as giving 100 weightage to all bad trajectories and 101 to good trajectories.
- To reduce variance, we can simply introduce the concept of baselines:which is essentially just subtracting a baseline value to give more importance to good trajectories, i.e. say if we subtract , then good trajectories get 1 weightage and everything else gets 0 weightage. We can simply set since:which essentially means that subtracting a baseline is unbiased in expectation. The baseline can also be learned during training (using back-propagation).
- One other way to reduce variance is by noticing the causality during calculating the rewards, i.e. a policy at time cannot affect the rewards obtained at time where . Thus gradient calculation equation can be modified as:
where is essentially the MC estimate of the state-action value function. Now notice that in this situation, we cannot introduce a constant baseline , since for each state the baseline will be different. For our running example, say at , the baseline value should be 100, but at the baseline value should be 50, and finally at the baseline value should be 0. Therefore there is a need to learn state dependent baselines:
ANOVA (for point 3)
From and including baselines we get . Now the variance is:
since baselines are unbiased in expectation
which is essentially just the expected rewards, weighted by gradient magnitudes
Off-policy Policy Gradients
The problem with the above algorithm is that it is completely on-policy, i.e. once the update happens, all the collected trajectories are discarded and new trajectories must to collected to perform the next policy gradient update. It would be more efficient if we did not have to discard all the previously collected trajectories since it took effort to collect the same. Hence, we look at ways to adapt the PG algorithm for the off-policy case.
We start with Importance Sampling, say we have an expectation of some function of w.r.t. distribution :
We can choose any distribution (it’s under our control) as long as the support of the distribution is same as that of the and for all such where .
Assuming we have trajectories from some other distribution instead of , we can modify the PG algorithm as follows:
Now,
Let’s assume that our current policy (after the ) policy update is parameterized by (that is used for collecting new transitions) whereas the policies that collected the previous samples (which are stored in a buffer) can be parameterized by . In doing so we are overloading some notations, therefore we will assume that the previously collected dataset was collected by some policy and the current policy is (note that by convention objects in the past are denoted by a letter and for corresponding future objects a prime is added, e.g. if current state is , next state is often denoted as ). Therefore we would like to update using samples from
Replacing with and expanding the remaining terms we get:
Now in order to decrease the variance in estimates, we can modify the above by introducing the notion of causality. We have already seen how the current reward is independent of the trajectory history. In this case notice further that the importance weights also have a notion of causality involved, i.e. current importance weights are not affected by future actions:
If we ignore the final term, we can recover an algorithm called policy iteration (cite) which has some nice convergence guarantees. However even in the reduced expression
notice that the second term is actually exponentially hard to compute and doesn’t work very well for long horizon tasks. Hence we can look into something like a “first-order” approximation of the same, which is essentially just modifying the on-policy algorithm by adding importance weights:
It is reasonable to assume that the marginals are similar (cite) and can be ignored.