Introduction to Reinforcement Learning

Reinforcement Learning (RL) is a discipline within machine learning that formalizes how to make sequential decisions under uncertainty. While machine learning techniques include supervised learning (cite) at one end of its spectrum (where given data $X$ and corresponding labels $y$ one needs to find the mapping from $X$ to $y$ by minimizing empirical risk) and unsupervised learning (cite) at the other end of its spectrum (where only data $X$ is provided and one needs to find pattern within $X$ by some clustering algorithm), none of these techniques deal with sequential decision making, i.e. although in the real world, the current decision has an effect on the future, such feedback is not considered in the predictions made by the machine learning models (which are used for single step of decision making).

Why is RL Hard?

Typically supervised learning problems have assumptions that make them “easy”:

Independent datapoints
Outputs don’t influence next inputs
Ground truth labels are provided at training time

Decision-making problems often don’t satisfy these assumptions:

Current actions influence future actions
Goal is to maximize some utility (reward)
Optimal actions are not provided

In many cases, real-world deployment of ML has these same feedback issues. For instance, decisions made by a traffic prediction system may affect the route that people take, which in turn changes the traffic.

Markov Decision Process

Pre-requisite: Markov Random Process (cite) and Markov Chains (cite) We model a sequential decision-making process under uncertainty as a Markov Decision Process (MDP). An MDP is specified by the tuple: $M = {S, A, T, r, ρ}$ where $S \in R^{d_{s}}$ is the ( $d_{s}$ dimensional) state space, $A \in R^{d_{a}}$ is the ( $d_{a}$ dimensional) action space, $T : S \times A \times S \to [0, 1]$ is the transition tensor (defines the dynamics governing the transition probabilities over next states given current state and action), $r : S \times A \times S \in R$ is the reward function, and $ρ : S \to [0, 1]$ is a probability distribution over the states according to which the initial states are sampled.

A policy $π : S \to A$ (the software inside the agent) takes as input a state $s \in S$ and outputs an action $a \in A$ . The policy can be learned by interacting with the MDP $M$ (environment). Initially, state $s_{0}$ is sampled from $ρ$ and provided to the agent. The agent takes some action according to its internal policy $π_{θ}$ (the policy is learnable and parameterized by $θ$ ), i.e. $a_{0} \sim π_{θ} (\cdot ∣ s_{0})$ . This action $a_{0}$ is provided to $M$ which then executes the action and provides the resultant next state $s_{1}$ , reward $r_{0} = r (s_{0}, a_{0}, s_{1})$ , and $d_{0}$ (denoting whether the next state is a terminal state or not) to the agent. Thus at any given time instance $t$ , an experience tuple (of agent-environment interaction) is of the form $(s_{t}, a_{t}, r_{t}, s_{t + 1}, d_{t})$ . When $d_{t} = T r u e$ , the environment is reset and a new initial state is sampled and returned to the agent. Thus a trajectory (assume it ran for $T$ steps) is of the form $τ = {s_{0}, a_{0}, r_{0}, s_{1}, a_{1}, r_{1}, s_{2}, ..., s_{T}, a_{T}, r_{T}, s_{T + 1}}$ .

Given $p_{θ} (τ) = ρ (s_{0}) t = 0 \prod T π_{θ} (a_{t} ∣ s_{t}) p (s_{t + 1} ∣ s_{t}, a_{t})$ the goal of reinforcement learning is to learn the optimal parameters $θ^{*}$ such that $π_{θ^{*}} \equiv π^{*}$ as follows:

θ^{*} = ar g θ max E_{τ \sim p_{θ} (τ)} t = 0 \sum T r_{t}

Note:

An alternative way to define MDPs is as follows: $M^{'} = {S, A, T, c, ρ}$ where $c$ is the cost function. Therefore, by convention, the objective of the policy for such MDPs would be to minimize the cost rather than maximize the reward. Therefore, by setting $r (\cdot) = - c (\cdot)$ , one can show that both the MDP definitions are equivalent.
It can be shown that the reward function can also be defined as $r : S \times A \in R$ without changing the expressive power of $M$ .
A Partially Observable/Observed Markov Decision Process (POMDP) is defined by the tuple $M = {S, A, O, T, E, r, ρ}$ where $O$ is the ( $d_{o}$ dimensional) observation space and $E : R^{d_{s}} \to R^{d_{o}}$ defines the emission probabilities $p (o_{t} ∣ s_{t})$ . All other symbols have their usual meanings. POMDPs are a generalization of MDPs where the agent is not given access to the internal state $s$ , rather a (partial) observation $o \in O$ . So, the policy has to take actions based on $o_{t}$ instead of $s_{t}$ . For instance, $s \in S$ can the positions, velocities and torques of the joints of a quadruple legged robot whereas the $o \in O$ can be an image of the robot taken using an external camera.

Some additional topics

State and action space

Populate

Until otherwise mentioned, we shall work with the case of continuous state and action space since they are very general, and almost all discrete setting analog are easy to derive/code.

Model of the environment

In machine learning, we often call the function approximator that maps data to labels as the “model”. Similarly in RL, model often refers to the approximation of the environment that can be learned by interacting with the MDP. Recall that during a step of agent-environment interaction: (i) the environment gives a state to the agent, (ii) the agent gives an action to the environment corresponding to the state according to some internal policy, and (iii) the environment performs an internal step and returns the next state, reward and whether it was a terminal state or not. Hence, in order to learn a successful “model” of the environment, one can learn three approximators as follows:

$\overset{s}{^}_{t + 1} = s_{t} + \hat{Δ}_{t + 1}$ ; where $\hat{Δ}_{t + 1} = f_{ϕ}^{s} (s_{t}, a_{t})$
$\overset{r}{^}_{t + 1} = f_{ϕ}^{r} (s_{t}, a_{t})$
$\hat{d}_{t + 1} = σ (f_{ϕ}^{d} (s_{t}, a_{t}))$

where, $f_{ϕ}$ is a (non-)linear function approximator parameterized by $ϕ$ and $σ (x) = \frac{1}{1 + e ^{- x}}$ is the sigmoid function.

Horizon Length of Trajectories

The length of trajectories depends on the MDP, and can be classified into the following three categories: (i) Fixed Horizon, (ii) Finite Horizon, and (iii) Infinite Horizon.

Populate

Note that in the notes, we will be using the finite horizon case unless otherwise specified.

Value of a state/state-action pair

E_{τ \sim p_{θ} (τ)} [t = 1 \sum T r (s_{t}, a_{t})] = E_{s_{0} \sim ρ (s_{0})} [E_{a_{0} \sim π (a_{0} ∣ s_{0})} [r (s_{0}, a_{0}) + E_{s_{1} \sim p (s_{1})} [E_{a_{1} \sim π (a_{1} ∣ s_{1})} [r (s_{1}, a_{1}) + ...∣ s_{1}]] This term is known as the Q-value of the state-action pair (s_{0}, a_{0}) ∣ s_{0}]] = E_{s_{0} \sim ρ (s_{0})} [E_{a_{0} \sim π (a_{0} ∣ s_{0})} [Q (s_{0}, a_{0}) ∣ s_{0}]]

One can think of the Q-value function as the function that gives the expected total reward obtained from taking action $a_{t}$ when starting from state $s_{t}$ by following policy $π$ :

Q^{π} (s_{t}, a_{t}) = t^{'} = t \sum T E_{π} [r (s_{t^{'}}, a_{t^{'}}) ∣ s_{t}, a_{t}]

Similarly the Value function can be defined as the expected total reward obtained when starting from state $s_{t}$ under policy $π$ :

V^{π} (s_{t}) = t^{'} = t \sum T E_{π} [r (s_{t^{'}}, a_{t^{'}}) ∣ s_{t}] = E_{a_{t} \sim π (a_{t} ∣ s_{t})} [Q^{π} (s_{t}, a_{t})]

Note: $E_{s_{0} \sim ρ (s_{0})} [V^{π} (s_{0})]$ is nothing but the aforementioned RL objective and the goal of RL is to maximize the same.

Anatomy of a RL algorithm

                  -------------> fit a model
                /             (estimate returns)
               |                      |
               |                      |
        generate samples              |
        (execute policy)              |
               ^                      |
               |                      v
                \                 improve the
                  ----------------- policy

On-policy vs Off-policy Algorithm

On policy:

Experience is collected using current policy $π_{θ}^{t}$ and after policy update, $π_{θ}^{t + 1}$ is the new policy.
Since the experiences were collected using $π_{θ}^{t}$ they are discarded and new trajectories are collected using $π_{θ}^{t + 1}$ for updating the policy.
This is quite “sample” inefficient since old trajectories are not re-used.
In practice however, the “wall-clock” time might actually be less if parallel simulations/real world experiments can be run.

Off policy:

Update algorithm for $π_{θ}^{t}$ is not restricted to experiences collected using $π_{θ}^{t}$ .
Sample efficient learning algorithms.
Catastrophic forgetting can be remedied by using previously collected samples.
Also since off-policy samples can be used to learn, some of the samples can be collected using random policy (say w.p. $ϵ$ ). Thus, “exploration-exploitation” tradeoff can be incorporated into the algorithm.

Lecture Notes

Explorer