Logo

Chapter 2: Multi-armed Bandits

The key difference between reinforcement learning and supervised learning is that RL uses training information to evaluate actions taken rather than instruct (i.e. supervised learning instructs the model on the correct output). In short, "evaluative feedback depends entirely on the action taken, whereas instructive feedback is independent of the action taken.

This chapter discusses a more simplified setting of RL—nonassociative RL, in which the model learns actions corresponding to only one situation. In particular, we explore the kk-armed bandit problem.

2.1 A kk-armed Bandit Problem

Consider the following learning problem. You are repeatedly given kk different actions. Each action provides a numerical reward drawn randomly from a stationary probability distribution corresponding to that action. The goal is to maximize the expected total reward over some time period. This is the original kk-armed bandit problem!

Fun Fact: Nomenclature

The kk-armed bandit problem is named as such because it is much like a slot machine with kk levers, and a regular slot machine (with one lever) is known colloquially as a "one-armed bandit."

For each action, it has associated with it some expected reward—we denote this as the value of the action. Let AtA_{t} denote the action taken at time step tt and RtR_{t} denote the corresponding reward. The value of an arbitrary action aa is

q(a)=E[RtAt=a]q_{*}(a)=\mathbb{E}[R_{t}\mid A_{t}=a]

Of course, we assume the value of each action is unknown (otherwise, its trivial). Therefore, we let Qt(a)Q_{t}(a) denote the estimated value of action aa at time step tt.

At any time step tt, there is always a greedy action aa' such that

a=argmaxa Qt(a)a'=\underset{ a }{ \arg\max }\ Q_{t}(a)

When selecting a greedy action, we say the model is exploiting current knowledge of QQ. Otherwise, we say the model is exploring, because selecting nongreedy actions helps improve the estimates of such actions. Exploitation is short term maximization, but exploration has the potential to translate to long term maximization. Of course, these principles are directly in conflict with each other, and a model must balance these properly to maximize reward.

2.2 Action-Value Methods

We now discuss methods for estimating the values of actions and using such estimates to select actions; these are aptly named action-value methods.

One natural way of estimating the value is by keeping a running average, i.e.

Qt(a)=i=1t1Ri1Ai=ai=1t11Ai=aQ_{t}(a)= \frac{\sum_{i=1}^{t-1} R_{i}\cdot \mathbb{1}_{A_{i}=a}}{\sum_{i=1}^{t-1} \mathbb{1}_{A_{i}=a}}

Where 1\mathbb{1} is an indicator variable for Ai=aA_{i}=a. This is known as the sample-average estimation method.

A similarly natural action selection method is greedy selection,

At=argmaxa Qt(a)A_{t}=\underset{a}{\arg\max}\ Q_{t}(a)

which relies entirely on exploitation.

A simple alternative is to behave greedily most of the time, but behave non-greedily some ϵ\epsilon fraction of the time, choosing randomly from the non-greedy action. Such methods are known as ϵ\epsilon-greedy methods. The primary advantage of such algorithms is that all Qt(a)Q_{t}(a) will converge to q(a)q_{*}(a) as tt\to \infty, unlike the greedy method in which only the greedy choice converges.

2.3 The 10-Armed Testbed

Generally speaking, in the 10-armed bandit situation, ϵ=0.1\epsilon=0.1 performed better than ϵ=0.01\epsilon=0.01 and ϵ=0\epsilon=0 (pure greedy), for some random reward distribution. Generally speaking, the optimal choice of ϵ\epsilon varies depending on the variance of the reward distributions.

2.4 Incremental Implementation

Notably, the estimate Qa(t)Q_{a}(t) of the value function is an average over the past selections; recomputing the whole average each time is thus O(t)O(t). We can do better!

Let RiR_{i} denote the reward corresponding to the iith selection of an action aa, and QnQ_{n} denote the estimate after n1n-1 selections of aa. Then,

Qn=R1++Rn1n1Q_{n}= \frac{R_{1}+\dots +R_{n-1}}{n-1}

We can then derive the following for Qn+1Q_{n+1}.

Qn+1=1ni=1nRi=1n(Rn+n1n1i=1n1Ri)=1n(Rn+(n1)Qn)=Qn+1n[RnQn]\begin{align*} Q_{n+1} &= \frac{1}{n}\sum_{i=1}^{n} R_{i} \\ &= \frac{1}{n}\left( R_{n}+ \frac{n-1}{n-1} \sum_{i=1}^{n-1} R_{i} \right) \\ &= \frac{1}{n}(R_{n}+(n-1)Q_{n}) \\ &= Q_{n}+\frac{1}{n}[R_{n}-Q_{n}] \end{align*}

Thus, this enables O(1)O(1) updates to the estimate.

tip

The update rule formula

NewEstimateOldEstimate+StepSize[TargetOldEstimate]\text{NewEstimate} \leftarrow \text{OldEstimate} + \text{StepSize}[\text{Target}-\text{OldEstimate}]

will appear frequently throughout the book. Here, it's observed in Qn+1=Qn+1n[RnQn]Q_{n+1}=Q_{n}+ \frac{1}{n}[R_{n}-Q_{n}]. Notably, the step-size parameter is frequently denoted as α\alpha or αt(a)\alpha_{t}(a).

2.5 Tracking a Nonstationary Problem

The aforementioned bandit problem is stationary, in that the reward distributions stay constant. Nonstationary RL problems are much more common, and, as one might guess, should ideally weight more recent information (rewards) more heavily int the estimate.

One popular method is to choose α(0,1]\alpha \in(0,1] to be constant for the update rule

Qn+1=Qn+α[RnQn]Q_{n+1}=Q_{n}+\alpha[R_{n}-Q_{n}]

which, upon expansion, provides

Qn+1=(1α)nQ1+i=1nα(1α)niRiQ_{n+1}=(1-\alpha)^{n}Q_{1}+\sum_{i=1}^{n} \alpha(1-\alpha)^{n-i}R_{i}

This is known as an exponential recency-weighted average.

Sometimes, α\alpha may be varied from step to step. However, it is not necessarily guaranteed that a sequence of α\alpha, i.e. {α1(a),,αn(a)}\{ \alpha_{1}(a),\dots,\alpha_{n}(a) \}, converges to the true action values. The required conditions to assure convergence are

n=1αn(a)=, n=1αn2(a)<\sum_{n=1}^{\infty} \alpha_{n}(a)=\infty,\ \sum_{n=1}^{\infty} \alpha^{2}_{n}(a)<\infty

The first condition guarantees the steps are large enough to overcome initial conditions or noise and converge to the right value, while the second guarantees the steps become small enough to converge.

Notably, αn(a)=1n\alpha_{n}(a)=\frac{1}{n} (sample average) guarantees convergence, while αn(a)=α\alpha_{n}(a)=\alpha does not. However, this is actually desirable for a nonstationary environment, since reward distributions can change. Thus, convergent step-size parameters are largely relegated to theoretical work.

2.6 Optimistic Initial Values

One must note that all methods so far are dependent on the initial action-value estimates, Q1(a)Q_{1}(a), which express a prior over each action that biases its estimates, perhaps permanently.

Clever choice of initial action values can impact results. For instance, setting all initial estimates to a large number encourages exploring all actions at least once—this is known as optimistic initial values, and is useful, at times, for stationary problems.

2.7 Upper-Confidence-Bound Action Selection

We now return to the idea behind ϵ\epsilon-greedy methods. Previously, we simply chose randomly from the nongreedy actions; however, it would be better to also consider the potential of each nongreedy action when selecting.

One way of doing this is to select according to

At=argmaxa [Qt(a)+clntNt(a)]A_{t}=\underset{a}{\arg\max}\ \left[ Q_{t}(a)+c\sqrt{ \frac{\ln t}{N_{t}(a)} } \right]

where Nt(a)N_{t}(a) denotes the number of times that action aa has been selected prior to time tt and c>0c>0 is some constant that controls exploration strength (larger cc corresponds to more exploration). This is known as upper confidence bound (UCB) action selection.

The idea is that the square-root term measures the uncertainty of the model's estimate for the value of action aa. Thus, actions with less selections (and therefore more uncertainty) may be selected over actions with more estimated value at the current time step. Notably, the lnt\ln t in the numerator also ensures that, when some action is selected, the UCB value of all other actions increases.

Generally, UCB performs well on the kk-bandit problem; however, it's difficult to extend to general RL because (1)(1) it doesn't fit nonstationary problems and (2)(2) it is computationally intractable for large state spaces.

2.8 Gradient Bandit Algorithms

An alternative to action-value methods are methods that learn a numerical preference for each of the actions, denoted Ht(a)H_{t}(a), and then select an action according to the preferences using a softmax. Note that πt(a)\pi_{t}(a) denotes the probability of taking action aa at time tt.

Softmax action preferences, meanwhile, lend naturally to stochastic gradient ascent methods. (Ascent, not descent). In essence, on each step we update

Ht+1(At)=Ht(At)+α(RtRt)(1πt(At))Ht+1(a)=Ht(a)α(RtRt)πt(a), aAt\begin{align*} H_{t+1}(A_{t}) &= H_{t}(A_{t})+\alpha(R_{t}-\overline{R}_{t})(1-\pi_{t}(A_{t})) \\ H_{t+1}(a) &= H_{t}(a)-\alpha(R_{t}-\overline{R}_{t})\pi_{t}(a),\ \forall a\neq A_{t} \end{align*}

where α>0\alpha>0 is a step-size parameter and RtR\overline{R}_{t}\in \mathbb{R} denotes the average of the rewards up to, but not including, time tt. (Note: Rt\overline{R}_{t} may be chosen to be a different metric, but the running average is simple and works well in practice). This essentially changes the probability of choosing AtA_{t} proportionally to the distance between the received reward and the average reward seen across previous time steps.

Why Stochastic?

The reason this method is known as stochastic gradient ascent and not exact gradient ascent is because we cannot compute the exact gradient without knowledge of q(x)q_{*}(x). However, the expected update of the gradient bandit algorithm is equivalent to the gradient of expected reward, and thus is expected to converge to the same values as the exact gradient.

2.9 Associative Search (Contextual Bandits)

Throughout this chapter, we've discussed the nonassociative kk-armed bandits problem. An example of an associative bandits setting would be, at each time step, being randomly presented one of a set of nn different kk-armed bandit tasks, with some data that may indicate which task is currently present (data which must be learned). This is known as contextual bandits, and is an associative search task because it require trial-and-error search for optimal actions and the association of such actions with their corresponding situations. They are a step above kk-armed bandits, but not quite at the level of the full RL problem, in which actions are allowed to affect the next situation/state and reward. This is the topic for the next chapter! :)