Let’s say we have a policy (known as the behavior policy). We run on the environment and generate (a dataset of experience tuples). Now, since each tuple is of the form corresponding to the observation, action, reward, next observation, and done respectively at time-step , we can easily extract a new dataset . Note that in the analogy of supervised learning, are the data features and are the targets. Since we consider the continuous state/action space, it might be helpful to think of the supervised learning task as a regression task. Note that although in this part of the notes all the expressions use observations (), however everything can be interchanged to states () without much modification.
Behavior Cloning (BC)
For behavior cloning to work, we need to assume that the behavior policy is also the optimal (expert) policy , since it doesn’t make sense to imitate sub-optimal policies. So, given we need to learn a mapping , i.e. a policy (which can be modeled as a parametric distribution over actions conditioned on the observation) . This can be easily learned by using ERM (expected risk minimization), which is equivalent to maximizing the log likelihood of the data as follows:
where the symbols have their usual meanings. In the regression setting, the ERM objective becomes:
Aside (Cost Function and RL Objective)
We can now analyze the “goodness” of the policy obtained after ERM by inspecting the difference from . Recall that the any MDP can be characterized by either a reward or a cost function. Assume the following cost function:
In that case the goal of any RL agent should be to minimize the cost (rather than the ERM):
Note: Although in ERM we are training without rewards, since the task is sequential decision making, we shall be using the above cost function to evaluate the performance of any trained policy.
Analysis
Let’s assume that for any (under the trained BC policy trained using ERM), the probability of incurring the cost (i.e. making error), i.e.:
where and is a small constant bound. So, the upper bound on the total error for a trajectory run by taking actions according to a BC trained policy is:
However, that is not the case during execution of the policy. Assume that the agent transitions from to after an erroneous decision. Since there is a distribution shift, all actions corresponding to time-steps onward until will be wrong and will incur a cost.
Let be the distribution over observations at the time-step that is obtained when the trajectory is collected with the policy :
where can be thought of as the distribution over observations when the trajectory is collected using ( in our case), i.e. the same distribution from which was generated, and similarly is the distribution over observations after the first mistake is made. We are interested in finding out how far off is the distribution of observations under the current policy (trained using the dataset ) vs. the behavior policy (that collected the training dataset). Manipulating the above expression we get:
However, since we are only interested in the magnitude of the difference, and not the sign, we can upper-bound the variational divergence as follows:
Thus the difference between the distribution of observations at time-step obtained by learning the optimal policy and the learned policy is upper-bounded by . Hence, the expected cost over the trajectory can be calculated as follows:
here is the Big-Oh Notation, not the observation space.
This problem can be addressed in a the following ways:
- Be smart about data collection (and augmentation)
- Use very powerful models that make very few mistakes
- Use multi-task learning
- Change the algorithm (DAgger)
DAgger (Dataset Aggregation)
GOAL: Collect data from instead of While for the support of :
1. Train $\pi_{\theta}(\hat{a}_{t}|o_{t})$ from dataset $\mathcal{D}'=\{o_{1},a_{1},...,o_{N},a_{N}\}$
2. Run $\pi_{\theta}(\hat{a}_{t}|o_{t})$ to get dataset $\mathcal{D}_{\pi}=\{o_{1},...,o_{M}\}$
3. Use $\pi^*$ to get $a_{t}$ $\forall o_{t}\in\mathcal{D}_{\pi}$
4. Aggregate $\mathcal{D}'\gets \mathcal{D}'\cup \mathcal{D}_{\pi}$
Since the dataset will eventually get populated with the entire support of , the error will be upper-bounded by as mentioned in , i.e. all possible will be in the augmented dataset.