Lecture 4: Reinforcement Learning Basics
The Markov Decision Process
The Markov Decision Process (MDP) is a way of describing reinforcement learning problems.
- Markov: for the Markov property
- Decision: because the model makes decisions
For now, we will assume , i.e. our policy is .
The critical difference between imitation and reinforcement learning is the lack of demonstration training examples for the model to learn from. Instead, in an MDP, we're provided a reward function , which represents the (state, action) pairs that are preferred. In fact, an MDP may be defined wholly by the following parameters.
- States .
- Actions .
- Rewards .
- Probabilities .
Markov chains originated as a representation for dynamical states that satisfied the Markov property, i.e. . It may be represented by , where is the state space and is the transition operator, which describes the linear operator applied to some vector , for which entry describes the probability of being in the th state at time , to produce , i.e. . Note that . Markov chains, of course, inspired MDPs.
More formally, we extend Markov chains to produce MDPs characterized by , where
- States .
- Actions .
- Transition operator .
- Reward function .
Given a policy , it is possible to turn an MDP into a Markov chain. It suffices to define the states of the Markov chain to be the set of all pairs and define the transition operator using the policy, i.e. .
We note that partially observed MDPs, i.e. , we must extend this definition slightly, i.e. , where
- States .
- Actions .
- Observations
- Transition operator .
- Emission operator .
- Reward function .
This can also be written as a Markov chain; for now, though, we will continue will our assumption that , i.e. fully observed MDPs.
Some Useful Concepts
For a finite horizon RL problem with policy , we can write that the probability of any sequence of states and actions is
by the chain rule of probability. is known as the trajectory distribution. Another notable distribution is the state marginal , which describes the probability of being in state at time step for some policy . Critically, both probabilities, while having very long, complex formulas, are easy to compute; the trajectory distribution is derived by simply simulating the MDP (just run the Markov chain), while the state marginal is computed by just simulating the policy in the real world.
The Objective of RL
Simple—choose a policy that maximizes total expected reward over the entire horizon.
where is the sequence/trajectory of states and actions and , the trajectory distribution.
Viewing it as an MDP in Markov chain form, we can derive that the total expected reward is
where is the vector of rewards for , is the first state, and is the transition probabilities according to the policy . In practice, of course, this is never computed this way, but it does give a nice, concise representation of the total expected reward.
But, what if the horizon is infinite? Typically, the total expected reward is represented as so that it does not approach infinity as well. Using this expression does, in fact, behave well. In particular, provided the underlying Markov chain is
- Ergodic: , it's possible to reach from .
- Aperiodic: the system does not return to a state periodically for some fixed interval, i.e. we do not have for some .
the terms from the infinite sum, past some sufficiently large threshold , are selected from a stationary distribution. This is important because a stationary distribution has the property that
which allows us to solve for , as it implies
or that is an eigenvector of with eigenvalue .
Expectations and Stochasticity
One critical underlying principle of RL is that we care about expectations on probabilities. Even with a discontinuous problem description (e.g. while ), the expectation remains differentiable in because it is being differentiated w.r.t policy parameters (which induces probabilities), not states. This allows us to use the same gradient-based optimization techniques widely popularized in machine learning.
Anatomy of RL Algorithms
The simple cycle is
- Generate samples (i.e. run policy)
- Fit a model/estimate return
- Improve policy
and repeat.
Notably, the relative cost and complexity of these steps differ between algorithms.
- Sample generation is cheap for problems that can be simulated, but expensive for those that must be run in real-time
- Fast for simple gradients, expensive for large model-based methods
- Again, fast for simple gradients, expensive for large model-based methods with backpropagation
Value Functions and -Functions
To calculate total expected reward for each (state, action) pair, we can define a function that is effectively equivalent to the reward plus the total expected reward given that the th (state, action) pair is . This should be intuitive, hopefully, and helps us quantify this notion of considering future rewards in maximizing total expected reward into one function, the -function. Formally,
A value function is a related concept that is based only on the state, i.e. represents the total expected reward given that the th state is . Notably, is the RL objective! Also, note that
So, how can we use -functions and value functions? Here are some important ideas.
- If is known, we can simply set for .
- Although, we can compute a policy gradient: if , then is a better-than-average action, so modifying to improve the probability of selecting should improve the policy.
Types of RL Algorithms
- Policy gradient methods compute the gradient of the objective with respect to the parameters. They tend to be simple, but require extremely large datasets.
- Value-based methods estimate the value function or -function of the optimal policy. (No explicit policy).
- Actor-critic methods estimate the value function or -function of the current policy to improve/train the policy. (They train a critic policy to improve/train the actor policy). They tend to be the most efficient methods.
- Model-based methods estimate the transition model, and are quite varied and often domain-specific.
Model-based RL
The model typically learns , and has several options for how to use this to improve the policy:
- Use the model to plan (basically backprop for trajectory optimization/related algorithms).
- Backpropagate gradients into the policy.
- Learn a value function (e.g. dynamic programming).
Value-based RL
The model learns or and then sets the policy based on its estimate, i.e. .
Policy gradient RL
The model evaluates the policy by summing the rewards, and then improves by computing the gradient of the objective.
Actor-Credit
Combination of value-based + policy gradient methods.
Why so many different RL algorithms?
There have tradeoffs.
- Sample efficiency
- Stability and ease of use
They make different assumptions.
- Stochastic or deterministic systems
- Continuous or discrete problem
- Episodic or infinite horizon
Different things are easy in different settings.
- Easier to represent the policy?
- Easier to represent the model?
Sample Efficiency
The most important consideration is if the algorithm is off policy. An off policy algorithm (e.g. model-based) is able to improve the policy without generating new samples from the policy, while an on policy (e.g. policy gradient) algorithm must on policygenerate new samples. Typically, the tradeoff is compute cost vs sample generation cost.
Stability and Ease of Use
To consider stability/ease of use, you should ask
- Does it converge?
- If so, what does it converge to?
- Does it always converge?
Why does this matter? Because, unlike supervised learning, RL is often not gradient-descent based, and thus does not come with the same guarantees. For instance,
- Value-based algorithms frequently have no guarantees of convergence.
- Model-based algorithms will converge, but won't necessarily converge to a good policy.
- Policy gradient algorithms do converge to good policies because they compute gradients directly on the objective function. (Gradient ascent).