Logo

Lecture 8: QQ-Learning in Practice

Target Networks

So, why doesn't QQ-learning work if we just implement what we discussed in Lecture 7? Well, it's not really gradient descent. Recall that the target values are associated with a stop-gradient operator, i.e.

θϕαdQθdθ(si,ai)(Qθ(si,ai)[r(si,ai)+γmaxaQθ(si,ai)]×)\theta\leftarrow \phi-\alpha \frac{\mathrm{d} Q_{\theta} }{\mathrm{d} \theta } (\mathbf{s}_{i},\mathbf{a}_{i})(Q_{\theta}(\mathbf{s}_{i},\mathbf{a}_{i})-[r(\mathbf{s}_{i},\mathbf{a}_{i})+\gamma \max _{\mathbf{a}'}Q_{\theta}(\mathbf{s}_{i}',\mathbf{a}_{i}')]_{\times })

therefore, QQ-learning is not gradient descent because it doesn't have a well-defined objective, and thus it does not necessarily converge!

Recall that QQ-learning is a type of fitted QQ-iteration algorithm. The primary difference, however, is that QQ-learning takes only one step between each gradient update. In contrast, QQ-iteration collects a large, static dataset of transitions, and then performs a full regression, not just a single gradient step, on the entire dataset. In essence, QQ-learning chases a moving target during optimization, while QQ-iteration "freezes" the target yiy_{i} values and performs essentially standard supervised learning each step to find the new θ\theta.

In some sense, one can think of this aspect of QQ-iteration as storing a "frozen copy" of the model for QθQ_{\theta} when learning the next value of θ\theta. We can attempt to achieve this same "static" property of QQ-iteration in QQ-learning with this frozen copy idea—we separately store a model QθˉQ_{\bar{\theta}} where θˉθ\bar{\theta}\leftarrow\theta every XX iterations, e.g. X=104X=10^{4}. Then, we use this frozen QQ-function to calculate the targets yiy_{i}, i.e.

yi=r(si,ai)+γmaxaQθˉ(si,a)y_{i}=r(\mathbf{s}_{i},\mathbf{a}_{i})+\gamma \max _{\mathbf{a}'}Q_{\bar{\theta}}(\mathbf{s}_{i}',\mathbf{a}')

this is known as QQ-learning with target networks, and allows us to stabilize learning a bit and "slow the moving target."

This has some drawbacks, though. In particular, QθˉQ_{\bar{\theta}} lags behind QθQ_{\theta}, and may produce a fairly inaccurate representation of the true QQ-function.

Deep QQ-Learning Network (DQN)

Famous foundational reinforcement learning algorithm—it's just generalized QQ-learning with a replay buffer and target network.

  1. Save target network parameters: θˉθ\bar{\theta}\leftarrow\theta.
    2. Collect dataset {(si,ai,si)}\{ (\mathbf{s}_{i},\mathbf{a}_{i},\mathbf{s}_{i}') \} with some policy, add to buffer R\mathcal{R}.
    3. Sample batch (si,ai,si)R(\mathbf{s}_{i},\mathbf{a}_{i},\mathbf{s}_{i}')\sim \mathcal{R}.
    4. Update θθαidQθdθ(si,ai)(Qθ(si,ai)[r(si,ai)+γmaxaQθˉ(si,ai)])\theta\leftarrow\theta-\alpha \sum_{i}\frac{\mathrm{d} Q_{\theta} }{\mathrm{d} \theta }(\mathbf{s}_{i},\mathbf{a}_{i})(Q_{\theta}(\mathbf{s}_{i},\mathbf{a}_{i})-[r(\mathbf{s}_{i},\mathbf{a}_{i})+\gamma \max_{\mathbf{a}'}Q_{\bar{\theta}}(\mathbf{s}_{i}',\mathbf{a}_{i}')]).

where the indentation denotes nested loops. For DQN, the innermost loop iterates 11 time and the loop starting on step 2 iterates 11 times.

Alternatives to Target Networks

Target networks, notably are discrete updates, in that they update only every XX steps. This produces bursty changes in the learning algorithm. In practice, this isn't usually an issue; but, if necessary, one solution is to use Polyak averaging, in which we update θˉτθˉ+(1τ)θ\bar{\theta}\leftarrow \tau\bar{\theta}+(1-\tau)\theta (e.g. τ=0.999\tau=0.999).

More Efficient QQ-Learning

Generalized QQ-learning with a replay buffer and target network can be represented as three different processes that essentially run at their own rates.

  1. Data collection (also evicting old data).
  2. Update target θˉ\bar{\theta}.
  3. QQ-function regression.

Here's the speeds for some different types of QQ-learning.

Also, here's how we'll describe the generalized QQ-learning algorithm with a replay buffer and target network.

  1. Save target network parameters: θˉθ\bar{\theta}\leftarrow\theta.

    1. Collect dataset {(si,ai,si)}\{ (\mathbf{s}_{i},\mathbf{a}_{i},\mathbf{s}_{i}') \} with some policy, add to buffer R\mathcal{R}. (Loop NN times).

      1. Sample batch (si,ai,si)R(\mathbf{s}_{i},\mathbf{a}_{i},\mathbf{s}_{i}')\sim \mathcal{R}. (Loop KK times).
      2. Update θθαidQθdθ(si,ai)(Qθ(si,ai)[r(si,ai)+γmaxaQθˉ(si,ai)])\theta\leftarrow\theta-\alpha \sum_{i}\frac{\mathrm{d} Q_{\theta} }{\mathrm{d} \theta }(\mathbf{s}_{i},\mathbf{a}_{i})(Q_{\theta}(\mathbf{s}_{i},\mathbf{a}_{i})-[r(\mathbf{s}_{i},\mathbf{a}_{i})+\gamma \max_{\mathbf{a}'}Q_{\bar{\theta}}(\mathbf{s}_{i}',\mathbf{a}_{i}')]).

So, which choices of N,KN,K are most efficient?

KK

Let's first focus on KK. This is also known as the update-to-data (UTD) ratio. A higher KK increases learning speed, as more updates are made for every data collection step, but this is more computationally expensive. Moreover, an excessively high KK will cause overfitting to the current set of data. In fact, K10K\geq10 is generally already a risky choice!

NN

Now, consider NN, the ratio of dataset collection steps to target network update steps. A lower NN increases learning speed, as fewer steps between target network updates keeps the target network more up-to-date. Generally speaking, though, N[103,104]N\sim[10^{3},10^{4}] as lower values creates a lot of instability.

nn-step Returns?

In essence, we can apply nn-step returns to the target values themselves!

yt(i)=t=tt+nγttr(st(i),at(i))+γnmaxat+n(i)Qθˉ(st+n(i),at+n(i))y_{t}^{(i)}=\sum_{t'=t}^{t+n} \gamma^{t'-t}r(\mathbf{s}_{t'}^{(i)},\mathbf{a}_{t'}^{(i)}) + \gamma^{n}\max _{\mathbf{a}_{t+n}^{(i)}} Q_{\bar{\theta}}(\mathbf{s}_{t+n}^{(i)},\mathbf{a}_{t+n}^{(i)})

This increases learning speed because the target values are now calculated from a mix of the up-to-date reward values from a trajectory and the out-of-date target network. However, this has a major limitation—because we're learning with a trajectory that originates from older data from a worse policy, the target values are lower than they should be! Thus, the off-policy nature of QQ-learning makes this nn-step returns modification biased. Hence, people will typically use small nn values if applying this technique.

Overestimation in QQ-Learning

For naive DQN, the QQ-functions tend to overestimate the actual returns. The problem is the max\max function in calculating the target value for target networks, i.e.

yi=r(si,ai)+γmaxaiQθˉ(si,ai)y_{i}=r(\mathbf{s}_{i},\mathbf{a}_{i})+\gamma\, \boxed{ \max }_{\mathbf{a}'_{i}}Q_{\bar{\theta}}(\mathbf{s}_{i},\mathbf{a}_{i})

Why? Because it essentially selects the QQ-functions with the largest positive error.

To form some intuition, consider two two random variables X1,X2X_{1},X_{2}. Then, E[max(X1,X2)]max(E[X1],E[X2])\mathbb{E}[\max(X_{1},X_{2})]\geq \max(\mathbb{E}[X_{1}],\mathbb{E}[X_{2}]). In other words, taking the max\max of the Qθ^Q_{\hat{\theta}} function systematically overestimates the maximum of the expected values of the actions.

Double QQ-Learning

We note that

maxaQθˉ(s,a)=Qθˉ(s,argmaxa Qθˉ(s,a))\max _{\mathbf{a}'}Q_{\bar{\theta}}(\mathbf{s}',\mathbf{a}') = Q_{\bar{\theta}}(\mathbf{s}',\underset{\mathbf{a}'}{\arg\max}\ Q_{\bar{\theta}}(\mathbf{s}',\mathbf{a}') )

the key idea of double QQ-learning is to use two distinct models in the RHS expression—one to choose the action (inner QQ) and one to evaluate the value (outer QQ). This helps eliminate the overestimation error because, if both models have different noise patterns, then the action chosen to maximize the inner QQ (and thus possesses the largest positive error) does not necessarily correspond to the value with the largest positive error in the outer QQ.

Classic double QQ-learning was thus modeled as

QθAr+γQθB(s,argmaxa QθA(s,a))QθBr+γQθA(s,argmaxa QθB(s,a))\begin{align*} Q_{\theta_{A}}\leftarrow r+\gamma Q_{\theta_{B}}(\mathbf{s}',\underset{\mathbf{a}'}{\arg\max}\ Q_{\theta_{A}}(\mathbf{s}',\mathbf{a}')) \\ Q_{\theta_{B}}\leftarrow r+\gamma Q_{\theta_{A}}(\mathbf{s}',\underset{\mathbf{a}'}{\arg\max}\ Q_{\theta_{B}}(\mathbf{s}',\mathbf{a}')) \end{align*}

In practice, though, there is an easier implementation for double QQ-learning. One may note that DQN itself already has two QQ-functions: the target network and the current QQ-function. Thus, double QQ-learning is modeled as

y=r+γQθˉ(s,argmaxa Qθ(s,a))y=r+\gamma Q_{\bar{\theta}}(\mathbf{s}',\underset{\mathbf{a}'}{\arg\max}\ Q_{\theta}(\mathbf{s}',\mathbf{a}'))
θθˉ\theta \sim\bar{\theta}??

Aren't θ\theta and θˉ\bar{\theta} correlated? Yes, but they are distinct enough that this proves very effective in practice. And this method is just much easier to implement so :)

Generalized Double QQ-Learning

Note that argmaxaQθ(s,a)\arg\max_{\mathbf{a}'}Q_{\theta}(\mathbf{s}',\mathbf{a}') is essentially equivalent to taking the action according to a greedy policy, i.e. Qθˉ(s,argmaxaQθ(s,a))=Eaπθ(as)[Qθˉ(s,a)]Q_{\bar{\theta}}(\mathbf{s}',\arg\max_{\mathbf{a}'}Q_{\theta}(\mathbf{s}',\mathbf{a}'))=\mathbb{E}_{\mathbf{a}'\sim \pi_{\theta}(\mathbf{a}'\mid \mathbf{s}')}[Q_{\bar{\theta}}(\mathbf{s}',\mathbf{a}')]. In other words,

y=r+γEaπθ(as)[Qθˉ(s,a)]y=r+\gamma \mathbb{E}_{\mathbf{a}'\sim \pi_{\theta}(\mathbf{a}'\mid \mathbf{s}')}[Q_{\bar{\theta}}(\mathbf{s}',\mathbf{a}')]

Occasionally, if overestimation is a problem even after double QQ-learning, some practitioners implement clipped double QQ-learning.

y=r+γEaπθ(a,s)[mini{1,2}Qθˉj(s,a)]y=r+\gamma \mathbb{E}_{\mathbf{a}'\sim \pi_{\theta}(\mathbf{a}',\mathbf{s}')}[\min _{i\in \{ 1,2 \}}Q_{\bar{\theta}_{j}}(\mathbf{s}',\mathbf{a}')]

where we now train an ensemble of two different target networks Qθˉ1Q_{\bar{\theta}_{1}} and Qθˉ2Q_{\bar{\theta}_{2}}. (You may have any number of target networks, but it gets increasingly expensive). This can, however, cause underestimates for QQ-learning, though it's often still practical for actor-critic!

Tips for QQ-Learning

Practical Tips

Advanced Tips

Back to Continuous Actions

QQ-learning with continuous actions

With continuous actions, choosing the value-maximizing action value is hard and inefficient. This is particularly problematic for when we're evaluating the target values, as this occurs in the innermost training loop. There are a couple of ways to solve this.

Solution 1: Stochastic Optimization

The simplest solution is to approximate maxaQ(s,a)\max_{\mathbf{a}}Q(\mathbf{s},\mathbf{a}) with some samples, i.e.

maxaQ(s,a)max{Q(s,a1),,Q(s,aN)}\max _{\mathbf{a}}Q(\mathbf{s},\mathbf{a})\approx \max \{ Q(\mathbf{s},\mathbf{a}_{1}),\dots ,Q(\mathbf{s},\mathbf{a}_{N}) \}

this is efficiently parallelizable and easy to implement, but unfortunately lacking in accuracy. We can do better.

these stochastic optimizers work decently up to about 40 dimensions.

info

Very popular improvement to offline actor-critic methods.

Such methods are also known as sample and rank or rejection sampling.

Solution 2: Learn Approximate Maximizer

The classic example is deep deterministic policy gradient (DDPG), and is essentially deterministic actor critic. The idea is to train another model μθ(s)\mu_{\theta}(\mathbf{s}) to approximate argmaxaQϕ(s,a)\arg\max_{\mathbf{a}}Q_{\phi}(\mathbf{s},\mathbf{a}), where we find θargmaxθQϕ(s,μθ(s))\theta\leftarrow\arg\max_{\theta}Q_{\phi}(\mathbf{s},\mu_{\theta}(\mathbf{s})). This is found by using the following

dQϕdθ=dadθdQϕda\frac{\textrm{d} Q_{\phi} }{\textrm{d} \theta } = \frac{\textrm{d} \mathbf{a} }{\textrm{d} \theta } \frac{\textrm{d} Q_{\phi} }{\textrm{d} \mathbf{a} }

which is essentially the same as the reparametrization trick from Lecture 6.

The new target value is thus

yj=rj+γQϕˉ(sj,μθˉ(sj))y_{j}= r_{j}+\gamma Q_{\bar{\phi}}(\mathbf{s}_{j}',\mu_{\bar{\theta}}(\mathbf{s}_{j}'))
Deterministic Actor-Critic?

μθ\mu_{\theta} is the actor, QϕQ_{\phi} is the critic. Deterministic because there is no expected value on QϕQ_{\phi} over sampled actions from the policy.

Some QQ-Learning Theory

Value Iteration

Let's return to value iteration.

  1. Q(s,a)r(s,a)+E[V(s)]Q(\mathbf{s},\mathbf{a})\leftarrow r(\mathbf{s},\mathbf{a})+\mathbb{E}[V(\mathbf{s}')].
  2. V(s)maxaQ(s,a)V(\mathbf{s})\leftarrow \max_{\mathbf{a}}Q(\mathbf{s},\mathbf{a}).
  3. Repeat.

Does value iteration converge?

Let B\mathcal{B} be an operator such that, for a vector VV composed of the values of the different states,

BV=maxara+γTaV\mathcal{B}V=\max _{\mathbf{a}}r_{\mathbf{a}} + \gamma \mathcal{T}_{\mathbf{a}}V

where rr is the stacked vector of rewards at all states for action a\mathbf{a} and Ta\mathcal{T}_{\mathbf{a}} is the matrix of transitions for action a\mathbf{a}, i.e. Ta,i,j=p(s=is=j,a)\mathcal{T}_{\mathbf{a},i,j}=p(\mathbf{s}'=i\mid \mathbf{s}=j,\mathbf{a}). This is essentially value iteration written out with just an operator B\mathcal{B} and a value vector VV, i.e.

  1. VBVV\leftarrow \mathcal{B}V
  2. Repeat

We note that VV^{*}, the vector representing the optimal value function, is a fixed point of B\mathcal{B}, i.e. V=BVV^{*}=\mathcal{B}V^{*}. We can prove that VV^{*} always exists, is always unique, and is always optimal. The way we prove that B\mathcal{B} converges to VV^{*} is by showing that B\mathcal{B} is a contraction. That is, for any vectors V,VˉV,\bar{V}, BVBVˉγVVˉ\lVert \mathcal{B}V-\mathcal{B}\bar{V} \rVert_{\infty}\leq\gamma \lVert V-\bar{V} \rVert_{\infty}. In other words, the gap always gets smaller by at least a factor of γ\gamma. We will not prove this now, just take my word :)

If we choose Vˉ=V\bar{V}=V^{*}, we have that BVBV=BVVγVV\lVert \mathcal{B}V-\mathcal{B}V^{*} \rVert_{\infty}=\lVert \mathcal{B}V-V^{*} \rVert_{\infty}\leq\gamma \lVert V-V^{*} \rVert_{\infty}, or that the operator always contracts the gap. Thus, value iteration converges.

\square

Non-Tabular Value Function Learning

Non-tabular value function learning, e.g. fitted value iteration, is a bit trickier, because we train a model that has at least some error. However, we can interpret the regression performed by fitted value iteration as a projection (in the L2 norm, due to MSE) onto the space of functions representable by the model.

VargminVΩ 12V(s)(BV)(s)2V'\leftarrow \underset{V'\in\Omega}{\arg\min}\ \frac{1}{2}\sum \lVert V'(\mathbf{s})-(\mathcal{B}V)(\mathbf{s}) \rVert^{2}

where Ω\Omega represents the space of representable value functions. In essence, each time we update VBVV\leftarrow \mathcal{B}V, we first project BV\mathcal{B}V back into Ω\Omega. To simplify this, we define a new operator

Π:ΠV=argminVΩ 12V(s)V(s)2\Pi:\Pi V=\underset{V'\in\Omega}{\arg\min}\ \frac{1}{2}\sum \lVert V'(\mathbf{s})-V(\mathbf{s}) \rVert ^{2}

so we can write fitted value iteration as

  1. VΠBVV\leftarrow \Pi \mathcal{B}V
  2. Repeat

Now, B\mathcal{B} is a contraction w.r.t \infty norm, and Π\Pi is a contraction w.r.t 2\ell_{2} norm. Yet, because they are contractions only w.r.t different norms, ΠB\Pi \mathcal{B} is not a contraction! Thus, while value iteration does converge, fitted value iteration does not converge. (Though, with deep RL, Ω\Omega is much larger in practice, and therefore works well).

\square

Fitted QQ-Iteration, Actor-Critic

We can do the same for QQ-iteration (and online QQ-learning) and actor-critic, and find that they too are not guaranteed convergence for essentially the same reason.

Conclusions