Reinforcement Learning (RL) is a discipline within machine learning that formalizes how to make sequential decisions under uncertainty. While machine learning techniques include supervised learning (cite) at one end of its spectrum (where given data and corresponding labels one needs to find the mapping from to by minimizing empirical risk) and unsupervised learning (cite) at the other end of its spectrum (where only data is provided and one needs to find pattern within by some clustering algorithm), none of these techniques deal with sequential decision making, i.e. although in the real world, the current decision has an effect on the future, such feedback is not considered in the predictions made by the machine learning models (which are used for single step of decision making).
Why is RL Hard?
Typically supervised learning problems have assumptions that make them “easy”:
- Independent datapoints
- Outputs don’t influence next inputs
- Ground truth labels are provided at training time
Decision-making problems often don’t satisfy these assumptions:
- Current actions influence future actions
- Goal is to maximize some utility (reward)
- Optimal actions are not provided
In many cases, real-world deployment of ML has these same feedback issues. For instance, decisions made by a traffic prediction system may affect the route that people take, which in turn changes the traffic.
Markov Decision Process
Pre-requisite: Markov Random Process (cite) and Markov Chains (cite) We model a sequential decision-making process under uncertainty as a Markov Decision Process (MDP). An MDP is specified by the tuple: where is the ( dimensional) state space, is the ( dimensional) action space, is the transition tensor (defines the dynamics governing the transition probabilities over next states given current state and action), is the reward function, and is a probability distribution over the states according to which the initial states are sampled.
A policy (the software inside the agent) takes as input a state and outputs an action . The policy can be learned by interacting with the MDP (environment). Initially, state is sampled from and provided to the agent. The agent takes some action according to its internal policy (the policy is learnable and parameterized by ), i.e. . This action is provided to which then executes the action and provides the resultant next state , reward , and (denoting whether the next state is a terminal state or not) to the agent. Thus at any given time instance , an experience tuple (of agent-environment interaction) is of the form . When , the environment is reset and a new initial state is sampled and returned to the agent. Thus a trajectory (assume it ran for steps) is of the form .
Given the goal of reinforcement learning is to learn the optimal parameters such that as follows:
Note:
- An alternative way to define MDPs is as follows: where is the cost function. Therefore, by convention, the objective of the policy for such MDPs would be to minimize the cost rather than maximize the reward. Therefore, by setting , one can show that both the MDP definitions are equivalent.
- It can be shown that the reward function can also be defined as without changing the expressive power of .
- A Partially Observable/Observed Markov Decision Process (POMDP) is defined by the tuple where is the ( dimensional) observation space and defines the emission probabilities . All other symbols have their usual meanings. POMDPs are a generalization of MDPs where the agent is not given access to the internal state , rather a (partial) observation . So, the policy has to take actions based on instead of . For instance, can the positions, velocities and torques of the joints of a quadruple legged robot whereas the can be an image of the robot taken using an external camera.
Some additional topics
State and action space
Populate
Until otherwise mentioned, we shall work with the case of continuous state and action space since they are very general, and almost all discrete setting analog are easy to derive/code.
Model of the environment
In machine learning, we often call the function approximator that maps data to labels as the “model”. Similarly in RL, model often refers to the approximation of the environment that can be learned by interacting with the MDP. Recall that during a step of agent-environment interaction: (i) the environment gives a state to the agent, (ii) the agent gives an action to the environment corresponding to the state according to some internal policy, and (iii) the environment performs an internal step and returns the next state, reward and whether it was a terminal state or not. Hence, in order to learn a successful “model” of the environment, one can learn three approximators as follows:
- ; where
where, is a (non-)linear function approximator parameterized by and is the sigmoid function.
Horizon Length of Trajectories
The length of trajectories depends on the MDP, and can be classified into the following three categories: (i) Fixed Horizon, (ii) Finite Horizon, and (iii) Infinite Horizon.
Populate
Note that in the notes, we will be using the finite horizon case unless otherwise specified.
Value of a state/state-action pair
One can think of the Q-value function as the function that gives the expected total reward obtained from taking action when starting from state by following policy :
Similarly the Value function can be defined as the expected total reward obtained when starting from state under policy :
Note: is nothing but the aforementioned RL objective and the goal of RL is to maximize the same.
Anatomy of a RL algorithm
-------------> fit a model
/ (estimate returns)
| |
| |
generate samples |
(execute policy) |
^ |
| v
\ improve the
----------------- policy
On-policy vs Off-policy Algorithm
On policy:
- Experience is collected using current policy and after policy update, is the new policy.
- Since the experiences were collected using they are discarded and new trajectories are collected using for updating the policy.
- This is quite “sample” inefficient since old trajectories are not re-used.
- In practice however, the “wall-clock” time might actually be less if parallel simulations/real world experiments can be run.
Off policy:
- Update algorithm for is not restricted to experiences collected using .
- Sample efficient learning algorithms.
- Catastrophic forgetting can be remedied by using previously collected samples.
- Also since off-policy samples can be used to learn, some of the samples can be collected using random policy (say w.p. ). Thus, “exploration-exploitation” tradeoff can be incorporated into the algorithm.