Lecture 8: -Learning in Practice
Target Networks
So, why doesn't -learning work if we just implement what we discussed in Lecture 7? Well, it's not really gradient descent. Recall that the target values are associated with a stop-gradient operator, i.e.
therefore, -learning is not gradient descent because it doesn't have a well-defined objective, and thus it does not necessarily converge!
Recall that -learning is a type of fitted -iteration algorithm. The primary difference, however, is that -learning takes only one step between each gradient update. In contrast, -iteration collects a large, static dataset of transitions, and then performs a full regression, not just a single gradient step, on the entire dataset. In essence, -learning chases a moving target during optimization, while -iteration "freezes" the target values and performs essentially standard supervised learning each step to find the new .
In some sense, one can think of this aspect of -iteration as storing a "frozen copy" of the model for when learning the next value of . We can attempt to achieve this same "static" property of -iteration in -learning with this frozen copy idea—we separately store a model where every iterations, e.g. . Then, we use this frozen -function to calculate the targets , i.e.
this is known as -learning with target networks, and allows us to stabilize learning a bit and "slow the moving target."
This has some drawbacks, though. In particular, lags behind , and may produce a fairly inaccurate representation of the true -function.
Deep -Learning Network (DQN)
Famous foundational reinforcement learning algorithm—it's just generalized -learning with a replay buffer and target network.
- Save target network parameters: .
2. Collect dataset with some policy, add to buffer .
3. Sample batch .
4. Update .
where the indentation denotes nested loops. For DQN, the innermost loop iterates time and the loop starting on step 2 iterates times.
Alternatives to Target Networks
Target networks, notably are discrete updates, in that they update only every steps. This produces bursty changes in the learning algorithm. In practice, this isn't usually an issue; but, if necessary, one solution is to use Polyak averaging, in which we update (e.g. ).
More Efficient -Learning
Generalized -learning with a replay buffer and target network can be represented as three different processes that essentially run at their own rates.
- Data collection (also evicting old data).
- Update target .
- -function regression.
Here's the speeds for some different types of -learning.
- In online -learning, data is evicted immediately, and all three processes run at the same speed.
- In DQN, process 1 and process 3 run at the same speed, while process 2 is slow.
- In fitted -iteration, process 3 runs faster than process 2, which runs faster than process 1
Also, here's how we'll describe the generalized -learning algorithm with a replay buffer and target network.
-
Save target network parameters: .
-
Collect dataset with some policy, add to buffer . (Loop times).
- Sample batch . (Loop times).
- Update .
-
So, which choices of are most efficient?
Let's first focus on . This is also known as the update-to-data (UTD) ratio. A higher increases learning speed, as more updates are made for every data collection step, but this is more computationally expensive. Moreover, an excessively high will cause overfitting to the current set of data. In fact, is generally already a risky choice!
Now, consider , the ratio of dataset collection steps to target network update steps. A lower increases learning speed, as fewer steps between target network updates keeps the target network more up-to-date. Generally speaking, though, as lower values creates a lot of instability.
-step Returns?
In essence, we can apply -step returns to the target values themselves!
This increases learning speed because the target values are now calculated from a mix of the up-to-date reward values from a trajectory and the out-of-date target network. However, this has a major limitation—because we're learning with a trajectory that originates from older data from a worse policy, the target values are lower than they should be! Thus, the off-policy nature of -learning makes this -step returns modification biased. Hence, people will typically use small values if applying this technique.
Overestimation in -Learning
For naive DQN, the -functions tend to overestimate the actual returns. The problem is the function in calculating the target value for target networks, i.e.
Why? Because it essentially selects the -functions with the largest positive error.
To form some intuition, consider two two random variables . Then, . In other words, taking the of the function systematically overestimates the maximum of the expected values of the actions.
Double -Learning
We note that
the key idea of double -learning is to use two distinct models in the RHS expression—one to choose the action (inner ) and one to evaluate the value (outer ). This helps eliminate the overestimation error because, if both models have different noise patterns, then the action chosen to maximize the inner (and thus possesses the largest positive error) does not necessarily correspond to the value with the largest positive error in the outer .
Classic double -learning was thus modeled as
In practice, though, there is an easier implementation for double -learning. One may note that DQN itself already has two -functions: the target network and the current -function. Thus, double -learning is modeled as
Aren't and correlated? Yes, but they are distinct enough that this proves very effective in practice. And this method is just much easier to implement so :)
Generalized Double -Learning
Note that is essentially equivalent to taking the action according to a greedy policy, i.e. . In other words,
Occasionally, if overestimation is a problem even after double -learning, some practitioners implement clipped double -learning.
where we now train an ensemble of two different target networks and . (You may have any number of target networks, but it gets increasingly expensive). This can, however, cause underestimates for -learning, though it's often still practical for actor-critic!
Tips for -Learning
Practical Tips
- -learning takes time to stabilize
- Test on easier tasks first to validate implementation
- Be patient when training
- Large replay buffers improve stability
- Start with high exploration (epsilon) and gradually reduce
Advanced Tips
- Bellman error gradients can be big; clip gradients or use Huber loss
- Double -learning is very effective in practice
- -step returns can help a lot, but be careful
- Schedule exploration/learning rate or just use Adam
- Run multiple times with different seeds
Back to Continuous Actions
-learning with continuous actions
With continuous actions, choosing the value-maximizing action value is hard and inefficient. This is particularly problematic for when we're evaluating the target values, as this occurs in the innermost training loop. There are a couple of ways to solve this.
Solution 1: Stochastic Optimization
The simplest solution is to approximate with some samples, i.e.
this is efficiently parallelizable and easy to implement, but unfortunately lacking in accuracy. We can do better.
- Cross-Entropy Method (CEM)
- CMA-ES
these stochastic optimizers work decently up to about 40 dimensions.
Very popular improvement to offline actor-critic methods.
Such methods are also known as sample and rank or rejection sampling.
Solution 2: Learn Approximate Maximizer
The classic example is deep deterministic policy gradient (DDPG), and is essentially deterministic actor critic. The idea is to train another model to approximate , where we find . This is found by using the following
which is essentially the same as the reparametrization trick from Lecture 6.
The new target value is thus
is the actor, is the critic. Deterministic because there is no expected value on over sampled actions from the policy.
Some -Learning Theory
Value Iteration
Let's return to value iteration.
- .
- .
- Repeat.
Does value iteration converge?
Let be an operator such that, for a vector composed of the values of the different states,
where is the stacked vector of rewards at all states for action and is the matrix of transitions for action , i.e. . This is essentially value iteration written out with just an operator and a value vector , i.e.
- Repeat
We note that , the vector representing the optimal value function, is a fixed point of , i.e. . We can prove that always exists, is always unique, and is always optimal. The way we prove that converges to is by showing that is a contraction. That is, for any vectors , . In other words, the gap always gets smaller by at least a factor of . We will not prove this now, just take my word :)
If we choose , we have that , or that the operator always contracts the gap. Thus, value iteration converges.
Non-Tabular Value Function Learning
Non-tabular value function learning, e.g. fitted value iteration, is a bit trickier, because we train a model that has at least some error. However, we can interpret the regression performed by fitted value iteration as a projection (in the L2 norm, due to MSE) onto the space of functions representable by the model.
where represents the space of representable value functions. In essence, each time we update , we first project back into . To simplify this, we define a new operator
so we can write fitted value iteration as
- Repeat
Now, is a contraction w.r.t norm, and is a contraction w.r.t norm. Yet, because they are contractions only w.r.t different norms, is not a contraction! Thus, while value iteration does converge, fitted value iteration does not converge. (Though, with deep RL, is much larger in practice, and therefore works well).
Fitted -Iteration, Actor-Critic
We can do the same for -iteration (and online -learning) and actor-critic, and find that they too are not guaranteed convergence for essentially the same reason.
Conclusions
- Fitted value-based methods (-learning, -function actor-critic) are frequently unstable!
- In practice, not guaranteed convergence
- Requires more hyperparameter tuning
- Some tricks are very helpful though...
- Replay buffers
- Target networks