Chapter 2: Multi-armed Bandits

The key difference between reinforcement learning and supervised learning is that RL uses training information to evaluate actions taken rather than instruct (i.e. supervised learning instructs the model on the correct output). In short, "evaluative feedback depends entirely on the action taken, whereas instructive feedback is independent of the action taken.

This chapter discusses a more simplified setting of RL—nonassociative RL, in which the model learns actions corresponding to only one situation. In particular, we explore the $k$ -armed bandit problem.

2.1 A $k$ -armed Bandit Problem

Consider the following learning problem. You are repeatedly given $k$ different actions. Each action provides a numerical reward drawn randomly from a stationary probability distribution corresponding to that action. The goal is to maximize the expected total reward over some time period. This is the original $k$ -armed bandit problem!

Fun Fact: Nomenclature

The $k$ -armed bandit problem is named as such because it is much like a slot machine with $k$ levers, and a regular slot machine (with one lever) is known colloquially as a "one-armed bandit."

For each action, it has associated with it some expected reward—we denote this as the value of the action. Let $A_{t}$ denote the action taken at time step $t$ and $R_{t}$ denote the corresponding reward. The value of an arbitrary action $a$ is

q_{*}(a)=\mathbb{E}[R_{t}\mid A_{t}=a]

Of course, we assume the value of each action is unknown (otherwise, its trivial). Therefore, we let $Q_{t}(a)$ denote the estimated value of action $a$ at time step $t$ .

At any time step $t$ , there is always a greedy action $a'$ such that

a'=\underset{ a }{ \arg\max }\ Q_{t}(a)

When selecting a greedy action, we say the model is exploiting current knowledge of $Q$ . Otherwise, we say the model is exploring, because selecting nongreedy actions helps improve the estimates of such actions. Exploitation is short term maximization, but exploration has the potential to translate to long term maximization. Of course, these principles are directly in conflict with each other, and a model must balance these properly to maximize reward.

2.2 Action-Value Methods

We now discuss methods for estimating the values of actions and using such estimates to select actions; these are aptly named action-value methods.

One natural way of estimating the value is by keeping a running average, i.e.

Q_{t}(a)= \frac{\sum_{i=1}^{t-1} R_{i}\cdot \mathbb{1}_{A_{i}=a}}{\sum_{i=1}^{t-1} \mathbb{1}_{A_{i}=a}}

Where $\mathbb{1}$ is an indicator variable for $A_{i}=a$ . This is known as the sample-average estimation method.

A similarly natural action selection method is greedy selection,

A_{t}=\underset{a}{\arg\max}\ Q_{t}(a)

which relies entirely on exploitation.

A simple alternative is to behave greedily most of the time, but behave non-greedily some $\epsilon$ fraction of the time, choosing randomly from the non-greedy action. Such methods are known as $\epsilon$ -greedy methods. The primary advantage of such algorithms is that all $Q_{t}(a)$ will converge to $q_{*}(a)$ as $t\to \infty$ , unlike the greedy method in which only the greedy choice converges.

2.3 The 10-Armed Testbed

Generally speaking, in the 10-armed bandit situation, $\epsilon=0.1$ performed better than $\epsilon=0.01$ and $\epsilon=0$ (pure greedy), for some random reward distribution. Generally speaking, the optimal choice of $\epsilon$ varies depending on the variance of the reward distributions.

2.4 Incremental Implementation

Notably, the estimate $Q_{a}(t)$ of the value function is an average over the past selections; recomputing the whole average each time is thus $O(t)$ . We can do better!

Let $R_{i}$ denote the reward corresponding to the $i$ th selection of an action $a$ , and $Q_{n}$ denote the estimate after $n-1$ selections of $a$ . Then,

Q_{n}= \frac{R_{1}+\dots +R_{n-1}}{n-1}

We can then derive the following for $Q_{n+1}$ .

\begin{align*} Q_{n+1} &= \frac{1}{n}\sum_{i=1}^{n} R_{i} \\ &= \frac{1}{n}\left( R_{n}+ \frac{n-1}{n-1} \sum_{i=1}^{n-1} R_{i} \right) \\ &= \frac{1}{n}(R_{n}+(n-1)Q_{n}) \\ &= Q_{n}+\frac{1}{n}[R_{n}-Q_{n}] \end{align*}

Thus, this enables $O(1)$ updates to the estimate.

tip

The update rule formula

\text{NewEstimate} \leftarrow \text{OldEstimate} + \text{StepSize}[\text{Target}-\text{OldEstimate}]

will appear frequently throughout the book. Here, it's observed in $Q_{n+1}=Q_{n}+ \frac{1}{n}[R_{n}-Q_{n}]$ . Notably, the step-size parameter is frequently denoted as $\alpha$ or $\alpha_{t}(a)$ .

2.5 Tracking a Nonstationary Problem

The aforementioned bandit problem is stationary, in that the reward distributions stay constant. Nonstationary RL problems are much more common, and, as one might guess, should ideally weight more recent information (rewards) more heavily int the estimate.

One popular method is to choose $\alpha \in(0,1]$ to be constant for the update rule

Q_{n+1}=Q_{n}+\alpha[R_{n}-Q_{n}]

which, upon expansion, provides

Q_{n+1}=(1-\alpha)^{n}Q_{1}+\sum_{i=1}^{n} \alpha(1-\alpha)^{n-i}R_{i}

This is known as an exponential recency-weighted average.

Sometimes, $\alpha$ may be varied from step to step. However, it is not necessarily guaranteed that a sequence of $\alpha$ , i.e. $\{ \alpha_{1}(a),\dots,\alpha_{n}(a) \}$ , converges to the true action values. The required conditions to assure convergence are

\sum_{n=1}^{\infty} \alpha_{n}(a)=\infty,\ \sum_{n=1}^{\infty} \alpha^{2}_{n}(a)<\infty

The first condition guarantees the steps are large enough to overcome initial conditions or noise and converge to the right value, while the second guarantees the steps become small enough to converge.

Notably, $\alpha_{n}(a)=\frac{1}{n}$ (sample average) guarantees convergence, while $\alpha_{n}(a)=\alpha$ does not. However, this is actually desirable for a nonstationary environment, since reward distributions can change. Thus, convergent step-size parameters are largely relegated to theoretical work.

2.6 Optimistic Initial Values

One must note that all methods so far are dependent on the initial action-value estimates, $Q_{1}(a)$ , which express a prior over each action that biases its estimates, perhaps permanently.

Clever choice of initial action values can impact results. For instance, setting all initial estimates to a large number encourages exploring all actions at least once—this is known as optimistic initial values, and is useful, at times, for stationary problems.

2.7 Upper-Confidence-Bound Action Selection

We now return to the idea behind $\epsilon$ -greedy methods. Previously, we simply chose randomly from the nongreedy actions; however, it would be better to also consider the potential of each nongreedy action when selecting.

One way of doing this is to select according to

A_{t}=\underset{a}{\arg\max}\ \left[ Q_{t}(a)+c\sqrt{ \frac{\ln t}{N_{t}(a)} } \right]

where $N_{t}(a)$ denotes the number of times that action $a$ has been selected prior to time $t$ and $c>0$ is some constant that controls exploration strength (larger $c$ corresponds to more exploration). This is known as upper confidence bound (UCB) action selection.

The idea is that the square-root term measures the uncertainty of the model's estimate for the value of action $a$ . Thus, actions with less selections (and therefore more uncertainty) may be selected over actions with more estimated value at the current time step. Notably, the $\ln t$ in the numerator also ensures that, when some action is selected, the UCB value of all other actions increases.

Generally, UCB performs well on the $k$ -bandit problem; however, it's difficult to extend to general RL because $(1)$ it doesn't fit nonstationary problems and $(2)$ it is computationally intractable for large state spaces.

2.8 Gradient Bandit Algorithms

An alternative to action-value methods are methods that learn a numerical preference for each of the actions, denoted $H_{t}(a)$ , and then select an action according to the preferences using a softmax. Note that $\pi_{t}(a)$ denotes the probability of taking action $a$ at time $t$ .

Softmax action preferences, meanwhile, lend naturally to stochastic gradient ascent methods. (Ascent, not descent). In essence, on each step we update

\begin{align*} H_{t+1}(A_{t}) &= H_{t}(A_{t})+\alpha(R_{t}-\overline{R}_{t})(1-\pi_{t}(A_{t})) \\ H_{t+1}(a) &= H_{t}(a)-\alpha(R_{t}-\overline{R}_{t})\pi_{t}(a),\ \forall a\neq A_{t} \end{align*}

where $\alpha>0$ is a step-size parameter and $\overline{R}_{t}\in \mathbb{R}$ denotes the average of the rewards up to, but not including, time $t$ . (Note: $\overline{R}_{t}$ may be chosen to be a different metric, but the running average is simple and works well in practice). This essentially changes the probability of choosing $A_{t}$ proportionally to the distance between the received reward and the average reward seen across previous time steps.

Why Stochastic?

The reason this method is known as stochastic gradient ascent and not exact gradient ascent is because we cannot compute the exact gradient without knowledge of $q_{*}(x)$ . However, the expected update of the gradient bandit algorithm is equivalent to the gradient of expected reward, and thus is expected to converge to the same values as the exact gradient.

2.9 Associative Search (Contextual Bandits)

Throughout this chapter, we've discussed the nonassociative $k$ -armed bandits problem. An example of an associative bandits setting would be, at each time step, being randomly presented one of a set of $n$ different $k$ -armed bandit tasks, with some data that may indicate which task is currently present (data which must be learned). This is known as contextual bandits, and is an associative search task because it require trial-and-error search for optimal actions and the association of such actions with their corresponding situations. They are a step above $k$ -armed bandits, but not quite at the level of the full RL problem, in which actions are allowed to affect the next situation/state and reward. This is the topic for the next chapter! :)

Chapter 2: Multi-armed Bandits

2.1 A kkk-armed Bandit Problem

2.2 Action-Value Methods

2.3 The 10-Armed Testbed

2.4 Incremental Implementation

2.5 Tracking a Nonstationary Problem

2.6 Optimistic Initial Values

2.7 Upper-Confidence-Bound Action Selection

2.8 Gradient Bandit Algorithms

2.9 Associative Search (Contextual Bandits)

2.1 A $k$ -armed Bandit Problem