Lecture 6: Actor-Critic Algorithms
Improving Policy Gradient with Values
Previously, we discussed the application of causality to reduce variance in policy gradient methods. In essence, we multiply the gradient for each time step by the "reward-to-go" , the rewards received starting from this time step until the end of the sample.
This reward-to-go is actually a random variable, since the rest of the trajectory is random (since the policy itself is stochastic). (As in, if you were to start from the same state and action pair multiple times, the reward to go would vary across the iterations due to the stochasticity of the policy). Naturally, this random variable has some associated variance; in fact, we can produce an estimate of that has lower variance relative to our existing estimate, i.e.
Ideally, we'd like our estimator to be an unbiased estimate of the true expected reward-to-go.
The current is unbiased (it is the same formula, after all, just taken over a few trajectories instead of the whole space) but experiences high variance.
Baseline?
Recall the baseline that replaced our reward-to-go estimate with , for some value . What should this value be? Previously, we discussed how the average reward served as a good baseline. However, this was for policy gradient methods without causality—what's a good baseline for our reward-to-go, i.e. ?
As it turns out, using the mean as a metric is still good; but, instead of averaging across rewards, we average across the possible -values for the state at time instead, or just the average reward-to-go.
This is known as the value function. As with the average reward baseline, this results in an estimator that retains unbiasedness while lowering variance. Thus,
The expression is so important it's denoted the advantage.
Positive advantage is assigned to actions that are better than average for the current state, and negative advantage is assigned to those that are worse than average.
State/State-Action Value Functions
In short,
- , the -function, is the total reward from taking in .
- , the value function, is the total reward from .
- , the advantage function, is the measure of how much better is for .
An actor-critic method will contain two models: one to capture the policy (actor), and one to capture the advantage function (critic). The latter estimator is typically a -function or value function.
Policy Evaluation
Policy evaluation is the process of using a policy to estimate or .
has a nice, recursive expression.
which we can estimate with a single sample as
This introduces a bit of variance due to replacing an expectation with a single sample, but not too much since it's only over a single step. This results in a nice expression for the advantage, though, that is expressed only in terms of .
Thus, we can just try to fit a model with parameters to estimate . (Note, this is possible with just -functions too, but this is originally how actor-critic methods were done).
This estimator of remains unbiased because the expectation was replaced with a single sample taken from the true data generating distribution.
The RL objective is just .
So how do we fit a model ?
Monte Carlo Method
Monte Carlo policy evaluation is the most natural method, and just involves generating some samples and averaging together (what our policy gradient method already does!). One drawback, however, is that to produce multiple samples for a single state, it requires resetting the simulator/environment back to the exact same state, which isn't necessarily possible; however, we can still use the single sample estimator.
Critically, this single sample estimator can still be effective! The key idea is that many states observed from our data will be similar, and a good neural network/model should be able to learn this similarity and effectively "average" their resulting trajectories/reward-to-go's together. Thus, the model we train to predict will actually provide better estimates than the single sample estimates themselves.
Actual Monte Carlo policy estimation, i.e. resetting the environment and measuring multiple trajectories, is better, but the model-based method works well in practice and is effectively an approximation of true Monte Carlo estimation.
The model is typically some sort of supervised regression, e.g. MSE regression
where the target label is
and is the measured value function for that trajectory, i.e. a single sample estimate of the value function.
Bootstrapping
Instead of using the single-sample estimate of to generate our target labels, however, we can use our previously fitted value function to estimate , i.e.
This is known as bootstrapping.
, as a target value, is treated as a constant when calculating the gradient with respect to . This is usually denoted by the stop-gradient operator , e.g.
This is used for bootstrapping because is a parameter in calculating , but its gradient should not be included when calculating because the target value should be a constant, and should not be "trained" or "learned." (Wouldn't really make sense for the model to learn to adjust the target value it's trying to match, now would it?).
Use of bootstrapping creates a biased estimator, due to an imperfect critic . Despite this, it performs much better than the standard Monte Carlo estimator because it substantially reduces variance.
Bootstrapping is known as temporal difference (TD) learning, which we will discuss in further detail in future lectures.
This has some issues, though, for infinite horizon RL problems. Imagine an environment where every single action (regardless of state) is . Regardless of what the neural network/model of is initialized to, its predictions will grow without bound, as the cyclical nature of bootstrapping causes increases in to increase , which increase , which increases , etc.
Discount Factors
We can fix this, though, with discounting! Discounting represents the idea that sooner rewards are better than later rewards with the same value, and is mathematically
for some constant discount factor (in practice, , e.g. ).
Notably, this changes the MDP. Every state now has a probability of transitioning to a "death" or absorbing state, in which no more rewards may be received/there are no transitions out of the death state. Thus, intuitively, discounting represents a belief that there is a probability that the horizon will end now, and therefore it's more desirable to receive rewards sooner than later.
Discounting also helps reduce variance. By deprioritizing rewards further in the future, we reduce variance because those rewards are more uncertain (higher variance) than more immediate rewards.
With discounting, we practically always use the reward-to-go formulation, i.e. causality. Why? Because this discounts rewards relative to the current time step. Without causality, it would just discount later rewards not relative to the current time step, but relative to .
Time-Varying vs. Time-Invariant
Oftentimes, we don't actually care the time step at which data is collected! Instead, we only care about the transition probabilities, i.e. we consider instead of . This just lets us change notation around, i.e. training data becomes
and similarly for other expressions. The problem itself doesn't change.
Examples
- TD-Gammon played Backgammon, using a value function that just estimated the expected outcome given board state
- AlphaGo played Go, using the same value function but a bigger, more advanced model
The Actor-Critic Algorithm
Basic Actor-Critic
- Sample from (run policy).
- Evaluate .
- Refit to targets , minimizing .
- Evaluate .
- Compute .
- Update .
- Repeat!
Online Actor-Critic
One may realize that, because we're just using transitions as training data, rather than whole trajectories, we don't need to use a whole trajectory or set of trajectories each iteration of model training.
The most extreme example of this is using only one transition to train the model each iteration. This actually allows use to have a fully online actor-critic algorithm that, during the simulation/in the environment, takes one transition, trains the model on that transition, and then uses the updated model to decide its next action.
This, however, has several issues.
- It is biased, as is essentially "out of date" by one time step.
- The data is not i.i.d. anymore! The next time step's data is dependent on the current time step's data.
- Small batch size of is very volatile.
In practice, this can actually work, but only with multiple parallel workers and lots of hyperparameter tuning.
On-Policy Actor-Critic
Can we improve the basic actor-critic algorithm? Yes!
- Better ways to estimate . (Improve the critic)
- Better ways to estimate . (Improve the actor)
So far, we've seen two methods of estimating . Actor-critic methods lower variance, but add some bias due to an imperfect critic, i.e. . Policy-gradient methods are entirely unbiased, but has higher variance due to it being a single sample estimate.
The actor-critic RL objectives here all (generally) take the form
for an estimator of the true -function (or the advantage function) and a baseline . The key idea is that must remain an unbiased estimator of the -function to keep the estimator unbiased. However, the baseline need only be independent of the current action to keep the estimator unbiased.
Is it possible to use but still produce an unbiased estimate?
In fact, we can, with the following estimator. (Note that we are actually improving the critic here).
which is essentially a compromise between the two methods, i.e. it's policy-gradient but the baseline is simply replaced by the value function. This produces a lower variance, unbiased estimator. (Sidenote: this is also a discounted, reward-to-go formulation).
As long as the baseline used doesn't depend on , it remains unbiased. See Lecture 5 for the proof of an unbiased baseline; it may be reused here.
However, it'd be nice if we could have a sort of "sliding scale" that determines the degree of mixing between the two methods, rather than only having this one combination option...
Eligibility Traces and -step Returns
Let denote the standard actor-critic estimator of advantage
Let denote the standard Monte Carlo estimator of advantage
In RL problems, we'd naturally expect higher variance further in the future. Thus, ideally, we'd like to use a high variance, unbiased estimator for time steps closer to the present, and have variance decrease (while bias may increase) as time moves further into the future.
One way of achieving this is to use an -step return estimator.
Essentially, the -step estimator uses the Monte Carlo estimator for steps—unbiased, but high variance—and then switches to the actor-critic estimator for the remaining steps until the end of the horizon (possibly )—high bias, low variance. This provides a discrete cutoff point at which we switch between the two estimators. commonly works better!
But, can we make a continuous averages of these two methods?
Generalized Advantage Estimation (GAE)
The generalized advantage estimator is a weighted average of -step returns.
where, typically, , so that -step returns closer to the present are weighted more, therefore reducing variance. Moreover, it leads to an elegant simplification of the advantage estimator.
where . In other words, the behaves similarly to the discount factor . This expression may then be rewritten as a vastly more efficient recursive formula
This is the most popular advantage estimator used for policy gradient methods in modern reinforcement learning.
No. actually affects the RL objective; smaller means more short-term prioritization. only affects the accuracy (bias/variance) of your estimator, not your actual objective. Hence, they should be adjusted as two distinct hyperparameters.
In practice, the advantage is also centered/normalized.
Off-Policy Actor-Critic
On-Policy vs Off-Policy
On-policy actor-critic methods update the policy using data generated from the current policy, and are
- Computationally cheap
- Sample inefficient
while off-policy actor-critic methods update the policy using past data generated from old policies, and are
- Computationally expensive
- Sample efficient
Replay Buffer
The key idea behind off-policy algorithms is the reuse of data. On-policy algorithms throw out all previous data with every iteration. Off-policy algorithms store previous data in a replay buffer.
Recall the online single-sample on-policy actor-critic method from [[#Online Actor-Critic|before]]. What if, instead of using the currently sampled transition to evaluate our policy, we store the current transition into our replay buffer for later use, and use a minibatch sampled from the replay buffer to evaluate our policy? Well, this would actually break the algorithm—the data in the replay buffer will no longer be representative of our current value estimator, and may instead be garbage from past, bad value estimators; thus, our policy evaluation may return garbage results.
Critically, though, if we estimate the -function instead of the value function, this works! How?
First, consider why the value function method doesn't work. We have
However, we want to be an expectation where is sampled from , not .
Now, let's look at using a -function instead. We have
The key distinction is that, while the data is still sampled from an old policy, the expectation is computed over actions drawn from the current policy . Why? Because we don't have to interact with the environment to sample from the latest policy and compute our -function, i.e. these are both represented by neural networks/models we are training. In contrast, the value function method must sample the actions from the old policy data; it cannot instead sample from the current policy because we're estimating , and estimating would require sampling a new transition/interacting with the environment, not reusing old policy data. (Because the estimate would be based on the next state, so, without state transition probabilities, one would need to simulate taking action from state in the environment to produce the next state, and then of this state may be computed).
This new algorithm needs a few more tweaks to finish it off.
- Sample using with one step and store in the replay buffer .
- Load a minibatch and evaluate for each transition.
- Update using , where is the minibatch size.
- Reuse the above trick to now sample (rather than ) from the latest policy, i.e. , and compute .
- .
- Repeat
In practice, is used in place of because variance only really matters when sampling is expensive, as using more data/samples naturally lowers variance, and therefore lowering variance through techniques like baselines is not as necessary for this off-policy actor-critic method. (And excluding variance-lowering methods results in lower implementation complexity).
Theoretically, yes, but in practice, with a sufficiently high capacity model, this poses no issues, as it doesn't really hurt the model.
Reparametrization Trick
Previously, with direct policy differentiation, we had to use samples from our policy used with the environment to compute the gradient . However, with this off-policy method, to estimate , we are estimating, essentially,
Crucially, and are both neural networks/models that we are training, and thus can evaluate without involving system dynamics/the environment at all! Therefore, we can actually compute with backprop, which produces a better gradient estimator.
This only works for continuous, parametric action distributions and when is differentiable.
How can we utilize this? WLOG, assume . Let . Then, one may note that . This allows us to approximate
and the last expression is easy to differentiate with an autograd library. This is known as the reparametrized gradient estimator.
See the last part of this section of Lecture 12 that I wrote after I did some more research into the reparametrization trick. Note that it may help to know what variational inference is, i.e. Lecture 11.
Summary
A collection of key formulas.
On-policy Actor/Advantage Estimation
Note that in the GAE expression is the aforementioned normalizing factor.