Logo

Lecture 4: Reinforcement Learning Basics

The Markov Decision Process

The Markov Decision Process (MDP) is a way of describing reinforcement learning problems.

info

For now, we will assume ot=st\mathbf{o}_{t}=\mathbf{s}_{t}, i.e. our policy is πθ(atst)\pi_{\theta}(\mathbf{a}_{t}\mid \mathbf{s}_{t}).

The critical difference between imitation and reinforcement learning is the lack of demonstration training examples for the model to learn from. Instead, in an MDP, we're provided a reward function r(st,at)r(\mathbf{s}_{t},\mathbf{a}_{t}), which represents the (state, action) pairs that are preferred. In fact, an MDP may be defined wholly by the following parameters.

Markov Chains

Markov chains originated as a representation for dynamical states that satisfied the Markov property, i.e. st+1st1st\mathbf{s}_{t+1}\perp \mathbf{s}_{t-1}\mid \mathbf{s}_{t}. It may be represented by M={S,T}\mathcal{M}=\{ \mathcal{S},\mathcal{T} \}, where S\mathcal{S} is the state space and T\mathcal{T} is the transition operator, which describes the linear operator applied to some vector μt\mu_{t}, for which entry ii describes the probability of being in the iith state at time tt, to produce μt+1\mu_{t+1}, i.e. μt+1=Tμt\mu_{t+1}=\mathcal{T}\mu_{t}. Note that Ti,j=p(st+1=ist=j)\mathcal{T}_{i,j}=p(\mathbf{s}_{t+1}=i\mid \mathbf{s}_{t}=j). Markov chains, of course, inspired MDPs.

More formally, we extend Markov chains to produce MDPs characterized by M={S,A,T,r}\mathcal{M}=\{ \mathcal{S},\mathcal{A},\mathcal{T},r \}, where

MDP \to Markov Chain?

Given a policy πθ(atst)\pi_{\theta}(\mathbf{a}_{t}\mid \mathbf{s}_{t}), it is possible to turn an MDP into a Markov chain. It suffices to define the states of the Markov chain to be the set of all (s,a)S×A(\mathbf{s},\mathbf{a})\in \mathcal{S}\times A pairs and define the transition operator using the policy, i.e. p(st+1,at+1st,at)=p(st+1st,at)πθ(at+1st+1)p(\mathbf{s}_{t+1},\mathbf{a}_{t+1}\mid \mathbf{s}_{t},\mathbf{a}_{t})=p(\mathbf{s}_{t+1}\mid \mathbf{s}_{t},\mathbf{a}_{t})\pi_{\theta}(\mathbf{a}_{t+1}\mid \mathbf{s}_{t+1}).

We note that partially observed MDPs, i.e. otst\mathbf{o}_{t}\neq \mathbf{s}_{t}, we must extend this definition slightly, i.e. M={S,A,O,T,E,r}\mathcal{M}=\{ \mathcal{S,A,O,T,E},r \}, where

This can also be written as a Markov chain; for now, though, we will continue will our assumption that ot=st\mathbf{o}_{t}=\mathbf{s}_{t}, i.e. fully observed MDPs.

Some Useful Concepts

For a finite horizon RL problem with policy πθ\pi_{\theta}, we can write that the probability of any sequence of states and actions is

pθ(s1,a1,,sH,aH)=p(s1)t=1Hπθ(atst)p(st+1st,at)p_{\theta}(\mathbf{s}_{1},\mathbf{a}_{1},\dots ,\mathbf{s}_{H},\mathbf{a}_{H}) = p(\mathbf{s}_{1})\prod_{t=1}^{H} \pi_{\theta}(\mathbf{a}_{t}\mid \mathbf{s}_{t})p(\mathbf{s}_{t+1}\mid \mathbf{s}_{t},\mathbf{a}_{t})

by the chain rule of probability. pθ(s1,a1,,sH,aH)p_{\theta}(\mathbf{s}_{1},\mathbf{a}_{1},\dots,\mathbf{s}_{H},\mathbf{a}_{H}) is known as the trajectory distribution. Another notable distribution is the state marginal pθ(st)p_{\theta}(\mathbf{s}_{t}), which describes the probability of being in state st\mathbf{s}_{t} at time step tt for some policy πθ\pi_{\theta}. Critically, both probabilities, while having very long, complex formulas, are easy to compute; the trajectory distribution is derived by simply simulating the MDP (just run the Markov chain), while the state marginal is computed by just simulating the policy in the real world.

The Objective of RL

Simple—choose a policy that maximizes total expected reward over the entire horizon.

θ=argmaxθ Eτpθ(τ)[tr(st,at)]\theta^{*} = \underset{\theta}{\arg\max}\ \mathbb{E}_{\tau\sim p_{\theta}(\tau)}\left[ \sum_{t} r(\mathbf{s}_{t},\mathbf{a}_{t}) \right]

where τ\tau is the sequence/trajectory of states and actions and pθ(τ)=pθ(s1,a1,,sH,aH)p_{\theta}(\tau)=p_{\theta}(\mathbf{s}_{1},\mathbf{a}_{1},\dots,\mathbf{s}_{H},\mathbf{a}_{H}), the trajectory distribution.

Viewing it as an MDP in Markov chain form, we can derive that the total expected reward is

E[t=1Hr(st,at)]=[t=1HTθt1μ1]Tr\mathbb{E}[\sum_{t=1}^{H}r(\mathbf{s}_{t},\mathbf{a}_{t})]=\left[ \sum_{t=1}^{H} \mathcal{T}_{\theta}^{t-1}\mu_{1} \right]^{T} \vec{r}

where r\vec{r} is the vector of rewards for r(s,a), (s,a)s×ar(\mathbf{s},\mathbf{a}),\ \forall (\mathbf{s},\mathbf{a})\in \mathcal{s}\times \mathcal{a}, μ1\mu_{1} is the first state, and Tθ\mathcal{T}_{\theta} is the transition probabilities according to the policy πθ\pi_{\theta}. In practice, of course, this is never computed this way, but it does give a nice, concise representation of the total expected reward.

But, what if the horizon HH is infinite? Typically, the total expected reward is represented as 1HE[]\frac{1}{H}\mathbb{E}[\dots] so that it does not approach infinity as well. Using this expression does, in fact, behave well. In particular, provided the underlying Markov chain is

the terms from the infinite sum, past some sufficiently large threshold TT, are selected from a stationary distribution. This is important because a stationary distribution has the property that

μˉ=Tθμˉ\bar{\mu}=\mathcal{T}_{\theta}\bar{\mu}

which allows us to solve for μˉ\bar{\mu}, as it implies

(TθI)μˉ=0(\mathcal{T}_{\theta}-\mathbf{I})\bar{\mu}=0

or that μˉ\bar{\mu} is an eigenvector of Tθ\mathcal{T}_{\theta} with eigenvalue 11.

Expectations and Stochasticity

One critical underlying principle of RL is that we care about expectations on probabilities. Even with a discontinuous problem description (e.g. r(go left)=1r(\text{go left})=1 while r(go right)=1r(\text{go right})=-1), the expectation Eπθ[r(x)]\mathbb{E}_{\pi_{\theta}}[r(\mathbf{x})] remains differentiable in θ\theta because it is being differentiated w.r.t policy parameters (which induces probabilities), not states. This allows us to use the same gradient-based optimization techniques widely popularized in machine learning.

Anatomy of RL Algorithms

The simple cycle is

  1. Generate samples (i.e. run policy)
  2. Fit a model/estimate return
  3. Improve policy

and repeat.

Notably, the relative cost and complexity of these steps differ between algorithms.

  1. Sample generation is cheap for problems that can be simulated, but expensive for those that must be run in real-time
  2. Fast for simple gradients, expensive for large model-based methods
  3. Again, fast for simple gradients, expensive for large model-based methods with backpropagation

Value Functions and QQ-Functions

To calculate total expected reward for each (state, action) pair, we can define a function Qπ(st,at)Q^{\pi}(\mathbf{s}_{t},\mathbf{a}_{t}) that is effectively equivalent to the reward r(st,at)r(\mathbf{s}_{t},\mathbf{a}_{t}) plus the total expected reward given that the ttth (state, action) pair is (st,at)(\mathbf{s}_{t},\mathbf{a}_{t}). This should be intuitive, hopefully, and helps us quantify this notion of considering future rewards in maximizing total expected reward into one function, the QQ-function. Formally,

Qπ(st,at)=t=tTEπθ[r(st,at)st,at]Q^{\pi}(\mathbf{s}_{t},\mathbf{a}_{t})=\sum_{t'=t}^{T} \mathbb{E}_{\pi_{\theta}}[r(\mathbf{s}_{t'},\mathbf{a}_{t'})\mid \mathbf{s}_{t},\mathbf{a}_{t}]

A value function is a related concept that is based only on the state, i.e. Vπ(st)=t=tTEπθ[r(st,at)st]V^{\pi}(\mathbf{s}_{t})=\sum_{t'=t}^{T}\mathbb{E}_{\pi_{\theta}}[r(\mathbf{s}_{t'},\mathbf{a}_{t'})\mid \mathbf{s}_{t}] represents the total expected reward given that the ttth state is st\mathbf{s}_{t}. Notably, Es1p(s1)[Vπ(s1)]\mathbb{E}_{\mathbf{s}_{1}\sim p(\mathbf{s}_{1})}[V^{\pi}(\mathbf{s}_{1})] is the RL objective! Also, note that

Vπ(s)=E[Qπ(s,a)]V^{\pi}(\mathbf{s})=\mathbb{E}[Q^{\pi}(\mathbf{s},\mathbf{a})]

So, how can we use QQ-functions and value functions? Here are some important ideas.

  1. If Qπ(s,a)Q^{\pi}(\mathbf{s},\mathbf{a}) is known, we can simply set π(as)=1\pi'(\mathbf{a}\mid \mathbf{s})=1 for a=argmaxa Qπ(s,a)\mathbf{a}=\underset{\mathbf{a}}{\arg\max}\ Q^{\pi}(\mathbf{s},\mathbf{a}).
  2. Although, we can compute a policy gradient: if Qπ(s,a)>Vπ(s)Q^{\pi}(\mathbf{s},\mathbf{a})>V^{\pi}(\mathbf{s}), then a\mathbf{a} is a better-than-average action, so modifying π(as)\pi(\mathbf{a}\mid \mathbf{s}) to improve the probability of selecting a\mathbf{a} should improve the policy.

Types of RL Algorithms

Model-based RL

The model typically learns p(st+1st,at)p(\mathbf{s}_{t+1}\mid \mathbf{s}_{t}, \mathbf{a}_{t}), and has several options for how to use this to improve the policy:

Value-based RL

The model learns V(s)V(\mathbf{s}) or Q(s,a)Q(\mathbf{s},\mathbf{a}) and then sets the policy based on its estimate, i.e. π(s)=argmaxa Q(s,a)\pi(\mathbf{s})=\underset{\mathbf{a}}{\arg\max}\ Q(\mathbf{s},\mathbf{a}).

Policy gradient RL

The model evaluates the policy by summing the rewards, and then improves by computing the gradient of the objective.

Actor-Credit

Combination of value-based + policy gradient methods.

Why so many different RL algorithms?

There have tradeoffs.

They make different assumptions.

Different things are easy in different settings.

Sample Efficiency

The most important consideration is if the algorithm is off policy. An off policy algorithm (e.g. model-based) is able to improve the policy without generating new samples from the policy, while an on policy (e.g. policy gradient) algorithm must on policygenerate new samples. Typically, the tradeoff is compute cost vs sample generation cost.

Stability and Ease of Use

To consider stability/ease of use, you should ask

Why does this matter? Because, unlike supervised learning, RL is often not gradient-descent based, and thus does not come with the same guarantees. For instance,