Logo

Lecture 20: Reinforcement Learning Theory

Why study RL theory?

First, there are some interesting questions asked in RL theory.

In analyzing RL theory, it's also necessary to make some strong assumptions to achieve any interesting results, while not deviating too far from reality. For...

The point of this is not to prove that an algorithm works perfectly every time (in fact, in deep RL, typically not even convergence is guaranteed). However, it does yield interesting conclusions on the effect of problem parameters/hyperparameters on error—RL theory produces qualitative conclusions about various factors under strong assumptions that are reasonable enough such that they may guide decisions in real RL problems..

Some Useful Identities

Qπ(s,a)=r(s,a)+γEsp(ss,a)[Vπ(s)]=r(s,a)+γsp(ss,a)Vπ(s)    Qπ=r+γPVπpπ(s,as,a)=π(as)p(ss,a)    Qπ=r+γPπQπQπγPπQπ=r(IγPπ)Qπ=rQπ=(IγPπ)1r\begin{align*} Q^{\pi}(s,a) &= r(s,a)+\gamma \mathbb{E}_{s'\sim p(s'\mid s,a)}[V^{\pi}(s')]=r(s,a)+\gamma \sum_{s'}p(s'\mid s,a)V^{\pi}(s') \implies \\ Q^{\pi} &= r + \gamma PV^{\pi} \\ \\ p^{\pi}(s',a'\mid s,a) &= \pi(a'\mid s')p(s'\mid s,a) \implies \\ Q^{\pi} &= r + \gamma P^{\pi}Q^{\pi} \\ \\ Q^{\pi}-\gamma P^{\pi}Q^{\pi} &= r \\ (\mathbf{I}-\gamma P^{\pi})Q^{\pi} &= r \\ Q^{\pi} &= (I-\gamma P^{\pi})^{-1}r \end{align*}

Note that rr is a vector, and Qπ,P,Pπ,VπQ^{\pi},P,P^{\pi},V^{\pi} are all matrices.

That final equation actually gives us a method of recovering the QQ-function, but is impractical not just for the typically high dimensionality of the state and action spaces but because we typically do not assume knowledge of PπP^{\pi} (involves state transition probabilities). Notably, though, that last equation tells us that policy evaluation is essentially just a linear operation.

Now, say r1r\approx1. From this, we can note that t=0γtrt=0γt=11γ\sum_{t=0}^{\infty}\gamma^{t}r\approx \sum_{t=0}^{\infty}\gamma^{t}=\frac{1}{1-\gamma}. Thus, the total reward of a trajectory is approximately 11γ\frac{1}{1-\gamma}, and thus our QQ-function matrix QπQ^{\pi} has elements that are roughly 11γ\frac{1}{1-\gamma} in magnitude. Alternatively, for a finite horizon problem (and no discounting), we have that the total reward of a trajectory is approximately t=1HrH\sum_{t=1}^{H}r\approx H, and our QQ-function matrix QπQ^{\pi} would have elements that are roughly HH in magnitude. Therefore, for infinite horizon tasks with discounting, we can roughly term 11γ\frac{1}{1-\gamma} as a "finite horizon" for the problem.

Convergence of Value Iteration

In linear algebra notation, value iteration is Vmaxa[r+γPV]V\leftarrow \max_{a}[r+\gamma PV]. Let's define an operator TT such that

TV=maxa[r+γPV]TV=\max _{a}[r+\gamma PV]

this is known as the Bellman optimality operator. Also, let us note the following useful inequality

maxxf(x)maxxg(x)maxxf(x)g(x)\lvert \max _{x}f(x)-\max _{x}g(x) \rvert \leq \max _{x}\lvert f(x)-g(x) \rvert

Consider, now, two value functions V(s)V(s) and U(s)U(s). We'll apply the Bellman optimality operator to both, and consider the difference.

TV(s)TU(s)=maxa[r(s,a)+γsp(ss,a)V(s)]maxa[r(s,a)+γsp(ss,a)U(s)]maxasp(ss,a)(V(s)U(s))γmaxsV(s)U(s)=γVU\begin{align*} \lvert TV(s)-TU(s) \rvert &= \left\lvert \max _{a}\left[ r(s,a)+\gamma \sum_{s'}p(s'\mid s,a)V(s') \right]-\max _{a}\left[ r(s,a)+\gamma \sum_{s'}p(s'\mid s,a)U(s') \right] \right\rvert \\ &\leq \max _{a}\left\lvert \sum_{s'}p(s'\mid s,a)(V(s')-U(s')) \right\rvert \\ &\leq \gamma\max _{s'}\lvert V(s')-U(s') \rvert \\ &= \gamma \lVert V-U \rVert _{\infty} \end{align*}

Thus, applying TT to any two value functions VV and UU will "bring them closer" w.r.t. to the infinity norm.

Also, we will note that TV=VTV^{*}=V^{*}, or that the optimal value function does not change by application of the Bellman optimality operator. We will not prove this fact for brevity.

Therefore,

TVV=TVTVVVTkVVγkVVlimkTkVV=0\begin{align*} \lVert TV-V^{*} \rVert _{\infty} &= \lVert TV-TV^{*} \rVert _{\infty} \leq \lVert V-V^{*} \rVert _{\infty} \\ \lVert T^{k}V-V^{*} \rVert _{\infty} &\leq \gamma^{k}\lVert V-V^{*} \rVert _{\infty} \\ \lim_{ k \to \infty } \lVert T^{k}V-V^{*} \rVert _{\infty} &= 0 \end{align*}

And thus value iteration converges!

Sample Complexity without Exploration

First, let's define our algorithm and assumptions. We will assume "oracle exploration," i.e. we can sample sp(ss,a)s'\sim p(s'\mid s,a) for each (s,a)(s,a) up to NN times. Our algorithm is a simple "model based" algorithm.

  1. p^(ss,a)=#(s,a,s)N\hat{p}(s'\mid s,a)=\frac{\#(s,a,s')}{N}.
  2. Given π\pi, use p^\hat{p} to estimate Q^π\hat{Q}^{\pi}.

Then, we'd like to answer the following questions.

  1. How close is Q^π\hat{Q}^{\pi} to QπQ^{\pi}?
  2. How close is Q^\hat{Q}^{*} to QQ^{*} if we learn it using p^\hat{p}?
  3. How close is the final policy Qπ^Q^{\hat{\pi}^{*}} to QQ^{*}?

We'll first address question 1, where we define (ϵ,δ)(\epsilon,\delta)-closeness as Qπ(s,a)Q^π(s,a)ϵ\lVert Q^{\pi}(s,a)-\hat{Q}^{\pi}(s,a) \rVert_{\infty}\leq\epsilon with probability at least 1δ1-\delta if Nf(ϵ,δ)N\geq f(\epsilon,\delta). Note that we use \lVert \cdot \rVert_{\infty} because it provides worst-case performance bounds.

Concentration Inequalities

Now, relating samples to errors is highly nontrivial. We will need a concentration inequality; Perhaps the simplest such inequality is Hoeffding's inequality.

hoeffdings-inequality.png

Intuitively, the interpretation is that if we estimate μ\mu with nn samples the probability we're off by more than ϵ\epsilon is at most 2e2nϵ2/(b+b)22e^{ -2n\epsilon^{2}/(b_{+}-b_{-})^{2} }.

Equivalently, if we want this probability bound to be δ\delta, we can show that we require

b+b2nlog2δϵ\frac{b_{+}-b_{-}}{\sqrt{ 2n }}\sqrt{ \log \frac{2}{\delta} } \geq \epsilon

Importantly, ϵ1n\epsilon \propto \frac{1}{\sqrt{ n }}.

However, we are estimating sample probabilities, not sample averages. For that, we need a different concentration inequality.

discrete-concentration.png

From this, we derive

δ1Nlog1δ\delta \leq \frac{1}{\sqrt{ N }}\sqrt{ \log \frac{1}{\delta} }

and we may apply this to our problem and write

p^(ss,a)p(ss,a)1S(1N+ϵ), with probability 1δ=SN+Slog1δNcSlog1δN\begin{align*} \lVert \hat{p}(s'\mid s,a)-p(s'\mid s,a) \rVert_{1} &\leq \sqrt{ \lvert S \rvert }\left( \frac{1}{\sqrt{ N }}+\epsilon \right),\text{ with probability } 1-\delta \\ &= \sqrt{ \frac{\lvert S \rvert }{N} } + \sqrt{ \frac{\lvert S \rvert \log \frac{1}{\delta}}{N} } \\ &\leq c\sqrt{ \frac{\lvert S \rvert \log \frac{1}{\delta}}{N} } \end{align*}

for some constant cc.

Useful Lemmas

We'd like to now relate the error of p^\hat{p} to the error of Q^π\hat{Q}^{\pi}. We can note that

Qπ=r+γPVπVπ=ΠQπPπ=PΠ\begin{align*} Q^{\pi} &= r+\gamma PV^{\pi} \\ V^{\pi} &= \Pi Q^{\pi} \\ P^{\pi} &= P\Pi \end{align*}

Note that Π\Pi represents the matrix of π(as)\pi(a\mid s) for all (s,a)(s,a), and PP represents the matrix of p(ss,a)p(s'\mid s,a) for all (s,a,s)(s,a,s'). These directly imply that

Qπ=r+γPπQπQπ=(IγPπ)1r\begin{align*} Q^{\pi} &= r+\gamma P^{\pi}Q^{\pi} \\ Q^{\pi} &= (I-\gamma P^{\pi})^{-1}r \end{align*}

Comparing this with the previous equation derived for Q^π\hat{Q}^{\pi}, we note the similarity

Qπ=(IγPπ)1rQ^π=(IγP^π)1r\begin{align*} Q^{\pi} &= (I-\gamma P^{\pi})^{-1}r \\ \hat{Q}^{\pi} &= (I-\gamma \hat{P}^{\pi})^{-1}r \end{align*}

Then, we claim the following simulation lemma is true.

QπQ^π=γ(IγP^π)1(PP^)VπQ^{\pi}-\hat{Q}^{\pi} = \gamma(I-\gamma \hat{P}^{\pi})^{-1}(P-\hat{P})V^{\pi}

Proof.

QπQ^π=Qπ(IγP^π)1r=(IγP^π)1(IγP^π)Qπ(IγP^π)1r=(IγP^π)1(IγP^π)Qπ(IγP^π)1(IγPπ)Qπ=(IγP^π)1((IγP^π)Qπ(IγPπ)Qπ)=γ(IγP^π)1(PπP^π)Qπ=γ(IγP^π)1(PΠP^Π)Qπ=γ(IγP^π)1(PP^)ΠQπ=γ(IγP^π)1(PP^)Vπ\begin{align*} Q^{\pi} - \hat{Q}^{\pi} &= Q^{\pi}-(I-\gamma \hat{P}^{\pi})^{-1}r \\ &= (I-\gamma \hat{P}^{\pi})^{-1}(I-\gamma \hat{P}^{\pi})Q^{\pi}-(I-\gamma \hat{P}^{\pi})^{-1}r \\ &= (I-\gamma \hat{P}^{\pi})^{-1}(I-\gamma \hat{P}^{\pi})Q^{\pi}-(I-\gamma \hat{P}^{\pi})^{-1}(I-\gamma P^{\pi})Q^{\pi} \\ &= (I-\gamma \hat{P}^{\pi})^{-1}((I-\gamma \hat{P}^{\pi})Q^{\pi}-(I-\gamma P^{\pi})Q^{\pi}) \\ &= \gamma(I-\gamma \hat{P}^{\pi})^{-1}(P^{\pi}-\hat{P}^{\pi})Q^{\pi} \\ &= \gamma(I-\gamma \hat{P}^{\pi})^{-1}(P\Pi-\hat{P}\Pi)Q^{\pi} \\ &= \gamma(I-\gamma \hat{P}^{\pi})^{-1}(P-\hat{P})\Pi Q^{\pi} \\ &= \gamma(I-\gamma \hat{P}^{\pi})^{-1}(P-\hat{P})V^{\pi} \\ \end{align*}

Another useful lemma is that, for some PπP^{\pi} and any vector vRSAv\in \mathbb{R}^{\lvert S \rvert \lvert A \rvert}, we have that

(IγPπ)1vv/(1γ)\lVert (I-\gamma P^{\pi})^{-1}v \rVert _{\infty} \leq \lVert v \rVert _{\infty}/(1-\gamma)

this is basically just a formalization of an obvious fact using the geometric series application with γ\gamma previously. It's really just stating that, in policy evaluation, the maximum QQ-function value is bounded by the maximum reward seen across all time steps multiplied by the effective horizon 11γ\frac{1}{1-\gamma}.

Proof.
Let w=(IγPπ)1vw=(I-\gamma P^{\pi})^{-1}v.

v=(IγPπ)wwγPπw(Triangle Inequality)wγw(Pπ1)=(1γ)w    v/(1γ)(IγPπ)1v\begin{align*} \lVert v \rVert _{\infty} &= \lVert (I-\gamma P^{\pi})w \rVert _{\infty} \\ &\geq \lVert w \rVert _{\infty}-\gamma \lVert P^{\pi}w \rVert _{\infty} && \text{(Triangle Inequality)} \\ &\geq \lVert w \rVert _{\infty} - \gamma \lVert w \rVert _{\infty} && (\lVert P^{\pi} \rVert _{\infty}\leq 1) \\ &= (1-\gamma)\lVert w \rVert _{\infty} \implies \\ \lVert v \rVert _{\infty} /(1-\gamma) &\geq \lVert (I-\gamma P^{\pi})^{-1}v \rVert_{\infty} \end{align*}

Putting it Together

Now, let's apply both lemmas and finish off the error bound.

QπQ^π=γ(IγP^π)(PP^)Vπ(Simulation Lemma)QπQ^π=γ(IγP^π)(PP^)Vπγ1γ(PP^)VπLemma 2γ1γ(maxs,ap(s,a)p^(s,a)1)Vπ\begin{align*} Q^{\pi}-\hat{Q}^{\pi} &= \gamma(I-\gamma \hat{P}^{\pi})(P-\hat{P})V^{\pi} && \text{(Simulation Lemma)} \\ \lVert Q^{\pi}-\hat{Q}^{\pi} \rVert _{\infty} &= \lVert \gamma(I-\gamma \hat{P}^{\pi})(P-\hat{P})V^{\pi} \rVert_{\infty} \\ &\leq \frac{\gamma}{1-\gamma} \lVert (P-\hat{P})V^{\pi} \rVert _{\infty} && \text{Lemma 2} \\ &\leq \frac{\gamma}{1-\gamma}(\max _{s,a}\lVert p(\cdot \mid s,a) -\hat{p}(\cdot \mid s,a) \rVert_{1})\lVert V^{\pi} \rVert _{\infty} \\ \end{align*}

Note that Vπ11γRmax\lVert V^{\pi} \rVert_{\infty}\leq \frac{1}{1-\gamma}R_{\text{max}} because of the same geometric series trick, and we can actually assume WLOG Rmax=1R_{\text{max}}=1 because, if they aren't, we can simply rescale the rewards appropriately (if we multiply the rewards by any constant, the optimal policy doesn't change). Thus, Vπ11γ\lVert V^{\pi} \rVert_{\infty} \leq \frac{1}{1-\gamma}. Thus,

QπQ^πγ(1γ)2(maxs,ap(s,a)p^(s,a)1)\begin{align*} \lVert Q^{\pi}-\hat{Q}^{\pi} \rVert _{\infty} &\leq \frac{\gamma}{(1-\gamma)^{2}}(\max _{s,a}\lVert p(\cdot \mid s,a) -\hat{p}(\cdot \mid s,a) \rVert_{1}) \end{align*}

Now recall the concentration inequality.

p^(ss,a)p(ss,a)cSlog1δN\lVert \hat{p}(s'\mid s,a)-p(s'\mid s,a) \rVert \leq c\sqrt{ \frac{\lvert S \rvert \log \frac{1}{\delta}}{N} }

Which allows us to conclude that our error bound is

QπQ^πγ(1γ)2c2Slog1δN=ϵ\lVert Q^{\pi}-\hat{Q}^{\pi} \rVert _{\infty} \leq \frac{\gamma}{(1-\gamma)^{2}}c_{2}\sqrt{ \frac{\lvert S \rvert \log \frac{1}{\delta}}{N} }=\epsilon

Note that c2c_{2} exists because there is some constant factor involved when substituting for the lemma.

Interpretation

So, what can we take away from our error bound ϵ\epsilon?

First, the error grows quadratically in the horizon and each backup "accumulates" error.

ϵ1(1γ)2\epsilon\propto \frac{1}{(1-\gamma)^{2}}

Second, more samples translates to lower error.

ϵ1N\epsilon\propto \sqrt{ \frac{1}{N} }

Implications about Other Questions

What about the other questions we asked previously?

Let's first consider the second, i.e. bounding QQ^\lVert Q^{*}-\hat{Q}^{*} \rVert. Note the following lemma

supxf(x)supxg(x)supxf(x)g(x)\lvert \sup_{x}f(x)-\sup_{x}g(x) \rvert \leq \sup_{x}\lvert f(x)-g(x) \rvert

(This is essentially the same as the previous inequality just with sup\sup, supremum, instead of max\max).

Then,

QQ^=supπQπsupπQ^πsupπQπQ^πϵ\begin{align*} \lVert Q^{*}-\hat{Q}^{*} \rVert _{\infty} &= \lVert \sup_{\pi}Q^{\pi}-\sup_{\pi}\hat{Q}^{\pi} \rVert _{\infty} \leq \sup_{\pi}\lVert Q^{\pi}-\hat{Q}^{\pi} \rVert _{\infty} \leq \epsilon \end{align*}

Now, let's consider the third, i.e. bounding QQπ^\lVert Q^{*}-Q^{\hat{\pi}^{*}} \rVert_{\infty}.

QQπ^=QQ^π^+Q^π^Qπ^QQ^π^+Qπ^Q^π^\begin{align*} \lVert Q^{*}-Q^{\hat{\pi}^{*}} \rVert _{\infty} &= \lVert Q^{*}-\hat{Q}^{\hat{\pi}^{*}}+\hat{Q}^{\hat{\pi}^{*}}-Q^{\hat{\pi}^{*}} \rVert _{\infty} \\ &\leq \lVert Q^{*}-\hat{Q}^{\hat{\pi}^{*}} \rVert _{\infty} + \lVert Q^{\hat{\pi}^{*}}-\hat{Q}^{\hat{\pi}^{*}} \rVert_{\infty} \\ \end{align*}

We note that Q^π^\hat{Q}^{\hat{\pi}^{*}} is essentially just Q^\hat{Q}^{*}, and so the first term is bounded by ϵ\epsilon according to our second derived bound, QQ^ϵ\lVert Q^{*}-\hat{Q}^{*} \rVert_{\infty}\leq\epsilon. Meanwhile, the second infinity norm is bounded by ϵ\epsilon according to our first derived bound, QπQ^πϵ\lVert Q^{\pi}-\hat{Q}^{\pi} \rVert\leq\epsilon, where π=π^\pi=\hat{\pi}^{*} here. Thus,

QQπ^2ϵ\lVert Q^{*}-Q^{\hat{\pi}^{*}} \rVert _{\infty} \leq 2\epsilon

Analysis of Model-Free RL

We'll now analyze fitted QQ-iteration.

First, let TT be the Bellman operator such that

TQ=r+γPmaxaQTQ = r+\gamma P\max _{a}Q

Then, exact fitted QQ-iteration may be defined as Qk+1TQkQ_{k+1}\leftarrow TQ_{k}. For approximate fitted QQ-iteration that uses samples, it changes slightly to be Q^k+1argminQQ^T^Q^k\hat{Q}_{k+1}\leftarrow \arg\min_{Q}\lVert \hat{Q}-\hat{T}\hat{Q}_{k} \rVert. T^\hat{T} is the Bellman operator, except it accounts for sampling by considering the effective frequency of observations of each ss'. In particular, we define T^\hat{T} as

T^Q=r^+γP^maxaQ\hat{T}Q = \hat{r}+\gamma \hat{P}\max _{a}Q

where

r^=1N(s,a)iδ((si,ai)=(s,a))rip^(ss,a)=N(s,a,s)N(s,a)\begin{align*} \hat{r} &= \frac{1}{N(s,a)}\sum_{i}\delta((s_{i},a_{i})=(s,a))r_{i} \\ \hat{p}(s'\mid s,a) &= \frac{N(s,a,s')}{N(s,a)} \end{align*}

However, our update rule now involves a norm—what norm should we use? Unfortunately, the algorithm won't actually converge if 2\lVert \cdot \rVert_{2} is used; therefore, we will assume \lVert \cdot \rVert_{\infty}. Notably, no such learning algorithm exists that may train with the infinity norm, and mean squared error is of course used in practice. Some interesting properties of the MSE-based learning algorithm are provable; however, they are much more difficult. Thus, we proceed with a theoretical learning algorithm that works with the infinity norm.

Now, we'd like to answer the following question: as kk\to \infty, Q^k  ?\hat{Q}_{k}\to\; ? or limkQ^kQ  ?\lim_{ k \to \infty }\lVert \hat{Q}_{k}-Q^{*} \rVert_{\infty}\leq\; ?. Where does our error come from though? For approximate fitted QQ-iteration, from sampling error, TT^T\neq \hat{T}, and approximation error, Q^k+1T^Q^k\hat{Q}_{k+1}\neq \hat{T}\hat{Q}_{k}.

Sampling Error

Let's first understand the effect of sampling error.

T^Q(s,a)TQ(s,a)=r^(s,a)r(s,a)+γ(Ep^(ss,a)[maxaQ(s,a)]Ep(ss,a)[maxaQ(s,a)])[r^(s,a)r(s,a)]+γs(p^(ss,a)p(ss,a))maxaQ(s,a)\begin{align*} \lvert \hat{T}Q(s,a)-TQ(s,a) \rvert &= \lvert \hat{r}(s,a)-r(s,a) +\gamma(\mathbb{E}_{\hat{p}(s'\mid s,a)}[\max _{a'}Q(s',a')]-\mathbb{E}_{p(s'\mid s,a)}[\max _{a'}Q(s',a')]) \rvert \\ &\leq [\hat{r}(s,a)-r(s,a)]+\gamma \left\lvert \sum_{s'}(\hat{p}(s'\mid s,a)-p(s'\mid s,a))\max _{a'}Q(s',a') \right\rvert \end{align*}

Where the inequality applied is triangle inequality. For the first term, the estimation of a continuous random variable, we can use Hoeffding's inequality.

r^(s,a)r(s,a)2Rmaxlog1δ2N\lvert \hat{r}(s,a)-r(s,a) \rvert \leq 2R_{\text{max}}\sqrt{ \frac{\log \frac{1}{\delta}}{2N} }

Meanwhile, for the second term, we can use the other concentration inequality

s(p^(ss,a)p(ss,a))maxaQ(s,a)sp^(ss,a)p(ss,a)maxs,aQ(s,a)=p^(s,a)p(s,a)1QcQlog1δN\begin{align*} \left\lvert \sum_{s'}(\hat{p}(s'\mid s,a)-p(s'\mid s,a))\max _{a'}Q(s',a') \right\rvert &\leq \sum_{s'}\lvert \hat{p}(s'\mid s,a)-p(s'\mid s,a) \rvert \max _{s',a'}Q(s',a') \\ &= \lVert \hat{p}(\cdot \mid s,a)-p(\cdot \mid s,a) \rVert_{1}\lVert Q \rVert _{\infty} \\ &\leq c\lVert Q \rVert _{\infty}\sqrt{ \frac{\log \frac{1}{\delta}}{N} } \end{align*}

Thus, we can bound our error as

T^Q(s,a)TQ(s,a)2Rmaxlog1δ2N+cQlog1δN\begin{align*} \lvert \hat{T}Q(s,a)-TQ(s,a) \rvert &\leq 2R_{\text{max}} \sqrt{ \frac{\log \frac{1}{\delta}}{2N} } + c\lVert Q \rVert _{\infty}\sqrt{ \frac{\log \frac{1}{\delta}}{N} } \end{align*}

Using the union bound, we can derive

T^Q(s,a)TQ(s,a)2Rmaxc1logSAδ2N+c2QlogSδN\begin{align*} \lvert \hat{T}Q(s,a)-TQ(s,a) \rvert &\leq 2R_{\text{max}}c_{1} \sqrt{ \frac{\log \frac{\lvert S \rvert \lvert A \rvert }{\delta}}{2N} } + c_{2}\lVert Q \rVert _{\infty}\sqrt{ \frac{\log \frac{\lvert S \rvert }{\delta}}{N} } \end{align*}

Approximation Error

We'll first analyze the exact backup operator. We will assume Q^k+1TQ^kϵk\lVert \hat{Q}_{k+1}-T\hat{Q}_{k} \rVert_{\infty}\leq\epsilon_{k} for some ϵk\epsilon_{k}. Note that this is a pretty strong assumption!

Q^kQ=Q^kTQ^k1+TQ^k1+Q=(Q^kTQ^k1)+(TQ^k1TQ)Q^kTQ^k1+TQ^k1TQϵk1+TQ^k1TQϵk1+γQ^k1Q\begin{align*} \lVert \hat{Q}_{k}-Q^{*} \rVert _{\infty} &= \lVert \hat{Q}_{k}-T\hat{Q}_{k-1} +T\hat{Q}_{k-1}+Q^{*} \rVert \\ &= \lVert (\hat{Q}_{k}-T\hat{Q}_{k-1})+(T\hat{Q}_{k-1}-TQ^{*}) \rVert_{\infty} \\ &\leq \lVert \hat{Q}_{k}-T\hat{Q}_{k-1} \rVert_{\infty} +\lVert T\hat{Q}_{k-1}-TQ^{*} \rVert_{\infty} \\ &\leq \epsilon_{k-1}+\lVert T\hat{Q}_{k-1}-TQ^{*} \rVert _{\infty} \\ &\leq \epsilon_{k-1} + \gamma\lVert \hat{Q}_{k-1}-Q^{*} \rVert _{\infty} \end{align*}

Recall the substitution of QQ^{*} with TQTQ^{*} in step 2 is possible because QQ^{*} is a fixed point of the operator TT. The last step leverages the fact that TT is a γ\gamma-contraction. We can recurse through this last inequality to produce

Q^kQi=0k1γiϵki1+γkQ^0QlimkQ^kQi=0k1γimaxkϵk=11γϵ=11γmaxkQ^kTQ^k1\begin{align*} \lVert \hat{Q}_{k}-Q^{*} \rVert _{\infty} &\leq \sum_{i=0}^{k-1} \gamma^{i}\epsilon_{k-i-1}+\gamma^{k}\lVert \hat{Q}_{0}-Q^{*} \rVert _{\infty} \\ \lim_{ k \to \infty } \lVert \hat{Q}_{k}-Q^{*} \rVert _{\infty} &\leq \sum_{i=0}^{k-1} \gamma^{i}\max _{k}\epsilon_{k} \\ &= \frac{1}{1-\gamma}\lVert \epsilon \rVert _{\infty} \\ &= \frac{1}{1-\gamma}\max _{k} \lVert \hat{Q}_{k}-T\hat{Q}_{k-1} \rVert _{\infty} \end{align*}

In other words, the approximation error scales with horizon.

Putting it Together

Let's now combine our bounds on the approximation error and sampling error to produce a bound on total error.

Q^kTQ^k1=Q^kT^Q^k1+T^Q^k1TQ^k1Q^kT^Q^k1approximation error+T^Q^k1TQ^k1sampling error\begin{align*} \lVert \hat{Q}_{k}-T\hat{Q}_{k-1} \rVert &= \lVert \hat{Q}_{k}-\hat{T}\hat{Q}_{k-1}+\hat{T}\hat{Q}_{k-1}-T\hat{Q}_{k-1} \rVert _{\infty} \\ &\leq \underbrace{ \lVert \hat{Q}_{k}-\hat{T}\hat{Q}_{k-1} \rVert _{\infty} }_{ \text{approximation error} }+\underbrace{ \lVert \hat{T}\hat{Q}_{k-1}-T\hat{Q}_{k-1} \rVert _{\infty} }_{ \text{sampling error} } \\ \end{align*}

Here are the bounds again, for convenience.

T^Q(s,a)TQ(s,a)2Rmaxc1logSAδ2N+c2QlogSδNlimkQ^kQ11γmaxkQ^kTQ^k1\begin{align*} \lvert \hat{T}Q(s,a)-TQ(s,a) \rvert &\leq 2R_{\text{max}}c_{1} \sqrt{ \frac{\log \frac{\lvert S \rvert \lvert A \rvert }{\delta}}{2N} } + c_{2}\lVert Q \rVert _{\infty}\sqrt{ \frac{\log \frac{\lvert S \rvert }{\delta}}{N} } \\ \lim_{ k \to \infty } \lVert \hat{Q}_{k}-Q^{*} \rVert_{\infty} &\leq \frac{1}{1-\gamma}\max _{k} \lVert \hat{Q}_{k}-T\hat{Q}_{k-1} \rVert _{\infty} \end{align*}

Notably, the Q\lVert Q \rVert_{\infty} in the sampling error bound is in O(Rmax11γ)O\left( R_{\text{max}} \frac{1}{1-\gamma} \right). (Recall from previously that the entries of the QQ-function matrix are approximately the size of the horizon, assuming r1r\approx1). Meanwhile, there exists a 11γ\frac{1}{1-\gamma} scalar for the approximation error bound. Thus, error compounds with quadratically with the horizon, since the error for each step Q^kTQ^k1\lVert \hat{Q}_{k}-T\hat{Q}_{k-1} \rVert scales with the horizon and there are HH steps for a horizon HH.