Chapter 2: Multi-armed Bandits
The key difference between reinforcement learning and supervised learning is that RL uses training information to evaluate actions taken rather than instruct (i.e. supervised learning instructs the model on the correct output). In short, "evaluative feedback depends entirely on the action taken, whereas instructive feedback is independent of the action taken.
This chapter discusses a more simplified setting of RL—nonassociative RL, in which the model learns actions corresponding to only one situation. In particular, we explore the -armed bandit problem.
2.1 A -armed Bandit Problem
Consider the following learning problem. You are repeatedly given different actions. Each action provides a numerical reward drawn randomly from a stationary probability distribution corresponding to that action. The goal is to maximize the expected total reward over some time period. This is the original -armed bandit problem!
The -armed bandit problem is named as such because it is much like a slot machine with levers, and a regular slot machine (with one lever) is known colloquially as a "one-armed bandit."
For each action, it has associated with it some expected reward—we denote this as the value of the action. Let denote the action taken at time step and denote the corresponding reward. The value of an arbitrary action is
Of course, we assume the value of each action is unknown (otherwise, its trivial). Therefore, we let denote the estimated value of action at time step .
At any time step , there is always a greedy action such that
When selecting a greedy action, we say the model is exploiting current knowledge of . Otherwise, we say the model is exploring, because selecting nongreedy actions helps improve the estimates of such actions. Exploitation is short term maximization, but exploration has the potential to translate to long term maximization. Of course, these principles are directly in conflict with each other, and a model must balance these properly to maximize reward.
2.2 Action-Value Methods
We now discuss methods for estimating the values of actions and using such estimates to select actions; these are aptly named action-value methods.
One natural way of estimating the value is by keeping a running average, i.e.
Where is an indicator variable for . This is known as the sample-average estimation method.
A similarly natural action selection method is greedy selection,
which relies entirely on exploitation.
A simple alternative is to behave greedily most of the time, but behave non-greedily some fraction of the time, choosing randomly from the non-greedy action. Such methods are known as -greedy methods. The primary advantage of such algorithms is that all will converge to as , unlike the greedy method in which only the greedy choice converges.
2.3 The 10-Armed Testbed
Generally speaking, in the 10-armed bandit situation, performed better than and (pure greedy), for some random reward distribution. Generally speaking, the optimal choice of varies depending on the variance of the reward distributions.
2.4 Incremental Implementation
Notably, the estimate of the value function is an average over the past selections; recomputing the whole average each time is thus . We can do better!
Let denote the reward corresponding to the th selection of an action , and denote the estimate after selections of . Then,
We can then derive the following for .
Thus, this enables updates to the estimate.
The update rule formula
will appear frequently throughout the book. Here, it's observed in . Notably, the step-size parameter is frequently denoted as or .
2.5 Tracking a Nonstationary Problem
The aforementioned bandit problem is stationary, in that the reward distributions stay constant. Nonstationary RL problems are much more common, and, as one might guess, should ideally weight more recent information (rewards) more heavily int the estimate.
One popular method is to choose to be constant for the update rule
which, upon expansion, provides
This is known as an exponential recency-weighted average.
Sometimes, may be varied from step to step. However, it is not necessarily guaranteed that a sequence of , i.e. , converges to the true action values. The required conditions to assure convergence are
The first condition guarantees the steps are large enough to overcome initial conditions or noise and converge to the right value, while the second guarantees the steps become small enough to converge.
Notably, (sample average) guarantees convergence, while does not. However, this is actually desirable for a nonstationary environment, since reward distributions can change. Thus, convergent step-size parameters are largely relegated to theoretical work.
2.6 Optimistic Initial Values
One must note that all methods so far are dependent on the initial action-value estimates, , which express a prior over each action that biases its estimates, perhaps permanently.
Clever choice of initial action values can impact results. For instance, setting all initial estimates to a large number encourages exploring all actions at least once—this is known as optimistic initial values, and is useful, at times, for stationary problems.
2.7 Upper-Confidence-Bound Action Selection
We now return to the idea behind -greedy methods. Previously, we simply chose randomly from the nongreedy actions; however, it would be better to also consider the potential of each nongreedy action when selecting.
One way of doing this is to select according to
where denotes the number of times that action has been selected prior to time and is some constant that controls exploration strength (larger corresponds to more exploration). This is known as upper confidence bound (UCB) action selection.
The idea is that the square-root term measures the uncertainty of the model's estimate for the value of action . Thus, actions with less selections (and therefore more uncertainty) may be selected over actions with more estimated value at the current time step. Notably, the in the numerator also ensures that, when some action is selected, the UCB value of all other actions increases.
Generally, UCB performs well on the -bandit problem; however, it's difficult to extend to general RL because it doesn't fit nonstationary problems and it is computationally intractable for large state spaces.
2.8 Gradient Bandit Algorithms
An alternative to action-value methods are methods that learn a numerical preference for each of the actions, denoted , and then select an action according to the preferences using a softmax. Note that denotes the probability of taking action at time .
Softmax action preferences, meanwhile, lend naturally to stochastic gradient ascent methods. (Ascent, not descent). In essence, on each step we update
where is a step-size parameter and denotes the average of the rewards up to, but not including, time . (Note: may be chosen to be a different metric, but the running average is simple and works well in practice). This essentially changes the probability of choosing proportionally to the distance between the received reward and the average reward seen across previous time steps.
The reason this method is known as stochastic gradient ascent and not exact gradient ascent is because we cannot compute the exact gradient without knowledge of . However, the expected update of the gradient bandit algorithm is equivalent to the gradient of expected reward, and thus is expected to converge to the same values as the exact gradient.
2.9 Associative Search (Contextual Bandits)
Throughout this chapter, we've discussed the nonassociative -armed bandits problem. An example of an associative bandits setting would be, at each time step, being randomly presented one of a set of different -armed bandit tasks, with some data that may indicate which task is currently present (data which must be learned). This is known as contextual bandits, and is an associative search task because it require trial-and-error search for optimal actions and the association of such actions with their corresponding situations. They are a step above -armed bandits, but not quite at the level of the full RL problem, in which actions are allowed to affect the next situation/state and reward. This is the topic for the next chapter! :)