Imitation Learning

Let’s say we have a policy $π^{β}$ (known as the behavior policy). We run $π^{β}$ on the environment and generate $D$ (a dataset of experience tuples). Now, since each tuple is of the form ${o_{t}, a_{t}, r_{t}, o_{t + 1}, d_{t}}$ corresponding to the observation, action, reward, next observation, and done respectively at time-step $t$ , we can easily extract a new dataset $D^{'} = {o_{t}, a_{t}}_{t = 0}^{∣ D ∣}$ . Note that in the analogy of supervised learning, ${o_{t}}_{t}$ are the data features and ${a_{t}}_{t}$ are the targets. Since we consider the continuous state/action space, it might be helpful to think of the supervised learning task as a regression task. Note that although in this part of the notes all the expressions use observations ( $o_{t}$ ), however everything can be interchanged to states ( $s_{t}$ ) without much modification.

Behavior Cloning (BC)

For behavior cloning to work, we need to assume that the behavior policy $π^{β}$ is also the optimal (expert) policy $π^{*}$ , since it doesn’t make sense to imitate sub-optimal policies. So, given $D^{'}$ we need to learn a mapping $π_{θ} : O \to A$ , i.e. a policy (which can be modeled as a parametric distribution over actions conditioned on the observation) $π_{θ} (a_{t} ∣ o_{t})$ . This can be easily learned by using ERM (expected risk minimization), which is equivalent to maximizing the log likelihood of the data as follows:

θ max E_{(o_{t}, a_{t}) \sim D^{'}} lo g π_{θ} (a_{t} ∣ o_{t})

where the symbols have their usual meanings. In the regression setting, the ERM objective becomes:

θ^{*} = ar g θ max E_{(o_{t}, a_{t}) \sim D^{'}} [(π_{θ} (o_{t}) - a_{t})^{2}]

Aside (Cost Function and RL Objective)

We can now analyze the “goodness” of the policy $π_{θ^{*}}$ obtained after ERM by inspecting the difference from $π^{*}$ . Recall that the any MDP can be characterized by either a reward or a cost function. Assume the following cost function:

c (o_{t}, a_{t}) = {01 a_{t} = π^{*} (o_{t}) o t h er w i se

In that case the goal of any RL agent should be to minimize the cost (rather than the ERM):

θ^{*} = ar g θ min E_{o_{t} \sim p_{π_{θ}} (o_{t})} [c (o_{t}, π_{θ} (o_{t}))]

Note: Although in ERM we are training without rewards, since the task is sequential decision making, we shall be using the above cost function to evaluate the performance of any trained policy.

Analysis

Let’s assume that for any $o_{t}$ (under the trained BC policy trained using ERM), the probability of incurring the cost (i.e. making error), i.e.:

π_{θ^{*}} (\overset{a}{^}_{t} \neq = a_{t} ∣ o_{t}) \leq ϵ; \forall (o_{t}, a_{t}) \in D^{'}

where $\overset{a}{^}_{t} \sim π_{θ^{*}} (s_{t})$ and $ϵ \geq 0$ is a small constant bound. So, the upper bound on the total error for a trajectory run by taking actions according to a BC trained policy $π_{θ^{*}}$ is:

E [t = 0 \sum T c (o_{t}, \overset{a}{^}_{t})] \leq ϵ T; only when all possible o_{t} \in D^{'} ... [1]

However, that is not the case during execution of the policy. Assume that the agent transitions from $o_{t} \in D^{'}$ to $o_{t + 1} \in / D^{'}$ after an erroneous decision. Since there is a distribution shift, all actions corresponding to time-steps $t + 1$ onward until $T$ will be wrong and will incur a cost.

Let $p_{θ^{*}} (o_{t})$ be the distribution over observations at the $t^{th}$ time-step that is obtained when the trajectory is collected with the policy $π_{θ^{*}}$ :

p_{θ^{*}} (o_{t}) = (1 - ϵ)^{t} p_{t r ain} (o_{t}) + (1 - (1 - ϵ)^{t}) p_{mi s t ak e} (o_{t})

where $p_{t r ain} (o_{t})$ can be thought of as the distribution over observations when the trajectory is collected using $π^{β}$ ( $π^{*}$ in our case), i.e. the same distribution from which $D^{'}$ was generated, and similarly $p_{mi s t ak e}$ is the distribution over observations after the first mistake is made. We are interested in finding out how far off is the distribution of observations under the current policy (trained using the dataset $D^{'}$ ) vs. the behavior policy (that collected the training dataset). Manipulating the above expression we get:

p_{θ^{*}} (o_{t}) ⟹ - p_{θ^{*}} (o_{t}) ⟹ p_{t r ain} (o_{t}) - p_{θ^{*}} (o_{t}) = (1 - ϵ)^{t} p_{t r ain} (o_{t}) + (1 - (1 - ϵ)^{t}) p_{mi s t ak e} (o_{t}) = (1 - ϵ)^{t} p_{t r ain} (o_{t}) - (1 - ϵ)^{t} p_{mi s t ak e} (o_{t}) + p_{mi s t ak e} (o_{t}) = p_{mi s t ak e} (o_{t}) + (1 - ϵ)^{t} (p_{t r ain} (o_{t}) - p_{mi s t ak e} (o_{t})) = - p_{mi s t ak e} (o_{t}) - (1 - ϵ)^{t} (p_{t r ain} (o_{t}) - p_{mi s t ak e} (o_{t})) = (p_{t r ain} (o_{t}) - p_{mi s t ak e} (o_{t})) - (1 - ϵ)^{t} (p_{t r ain} (o_{t}) - p_{mi s t ak e} (o_{t})) = (1 - (1 - ϵ)^{t}) (p_{t r ain} (o_{t}) - p_{mi s t ak e} (o_{t}))

However, since we are only interested in the magnitude of the difference, and not the sign, we can upper-bound the variational divergence as follows:

⟹ ∣ p_{t r ain} (o_{t}) - p_{θ^{*}} (o_{t}) ∣ = (1 - (1 - ϵ)^{t}) \cdot ∣ p_{t r ain} (o_{t}) - p_{mi s t ak e} (o_{t}) ∣ \leq (1 - (1 - ϵ)^{t}) \cdot 2 (using the identity ∣ a - b ∣ < ∣ a ∣ + ∣ b ∣ & max P [\cdot] = 1) \leq 2 ϵ t (using the identity (1 - ϵ)^{t} \geq 1 - ϵ t)

Thus the difference between the distribution of observations at time-step $t$ obtained by learning the optimal policy and the learned policy is upper-bounded by $2 ϵ t$ . Hence, the expected cost over the trajectory can be calculated as follows:

E_{o_{t} \sim p_{θ^{*}} (\cdot), \overset{a}{^}_{t} \sim π_{θ^{*}} (\cdot ∣ o_{t})} [t = 0 \sum T c (o_{t}, \overset{a}{^}_{t})] ∴ E_{o_{t} \sim p_{θ^{*}} (\cdot), \overset{a}{^}_{t} \sim π_{θ^{*}} (\cdot ∣ o_{t})} [t = 0 \sum T c (o_{t}, \overset{a}{^}_{t})] = t = 0 \sum T E_{o_{t}, \overset{a}{^}_{t}} [c (o_{t}, \overset{a}{^}_{t})] (linearity of expectations) = t = 0 \sum T o_{t} \sum p_{θ^{*}} (o_{t}) \cdot c (o_{t}, \overset{a}{^}_{t}) = t = 0 \sum T o_{t} \sum (p_{t r ain} (o_{t}) + p_{θ^{*}} (o_{t}) - p_{t r ain} (o_{t})) \cdot c (o_{t}, \overset{a}{^}_{t}) = t = 0 \sum T o_{t} \sum p_{t r ain} (o_{t}) \cdot c (o_{t}, \overset{a}{^}_{t}) + (p_{θ^{*}} (o_{t}) - p_{t r ain} (o_{t})) \cdot c (o_{t}, \overset{a}{^}_{t}) = t = 0 \sum T (E_{o_{t}, \overset{a}{^}_{t}} [c (o_{t}, \overset{a}{^}_{t})] + o_{t} \sum (p_{θ^{*}} (o_{t}) - p_{t r ain} (o_{t})) \cdot c (o_{t}, \overset{a}{^}_{t})) \leq t = 0 \sum T (E_{o_{t}, \overset{a}{^}_{t}} [c (o_{t}, \overset{a}{^}_{t})] + o_{t} \sum ∣ p_{θ^{*}} (o_{t}) - p_{t r ain} (o_{t}) ∣ \cdot c_{max}) \leq t = 0 \sum T ϵ + 2 ϵ t (c_{max} =1 as defined above) = ϵ t = 0 \sum T 1 + 2 ϵ t = 0 \sum T t = ϵ T + 2 ϵ \frac{T ( T + 1 )}{2} = ϵ T + ϵ T (T + 1) = 2 ϵ T + ϵ T^{2} \in O (T^{2})

here $O$ is the Big-Oh Notation, not the observation space.

This problem can be addressed in a the following ways:

Be smart about data collection (and augmentation)
Use very powerful models that make very few mistakes
Use multi-task learning
Change the algorithm (DAgger)

DAgger (Dataset Aggregation)

GOAL: Collect data from $p_{π_{θ}} (o_{t})$ instead of $p_{π^{*}} (o_{t})$ While $π_{θ} \neq = π^{*}$ for the support $o_{t}$ of $π^{*}$ :

1. Train $\pi_{\theta}(\hat{a}_{t}|o_{t})$ from dataset $\mathcal{D}'=\{o_{1},a_{1},...,o_{N},a_{N}\}$
2. Run $\pi_{\theta}(\hat{a}_{t}|o_{t})$ to get dataset $\mathcal{D}_{\pi}=\{o_{1},...,o_{M}\}$
3. Use $\pi^*$ to get $a_{t}$ $\forall o_{t}\in\mathcal{D}_{\pi}$
4. Aggregate $\mathcal{D}'\gets \mathcal{D}'\cup \mathcal{D}_{\pi}$

Since the dataset will eventually get populated with the entire support of $π^{*}$ , the error will be upper-bounded by $ϵ T$ as mentioned in $[1]$ , i.e. all possible $o_{t}$ will be in the augmented dataset.

Lecture Notes

Explorer

Imitation Learning

Behavior Cloning (BC)

Aside (Cost Function and RL Objective)

Analysis

DAgger (Dataset Aggregation)

Graph View

Table of Contents

Backlinks