12. Online Algorithms
An online algorithm is one that can process data one-by-one and make more informed decisions based on past data. Essentially, an adaptive learning algorithm.
Experts
Consider a set of experts that each make a prediction about a certain event for time steps. Let be the set of possible predictions. The problem proceeds as follows.
- At time step , each expert makes a prediction; let be the vector of predictions.
- The algorithm makes a prediction , and then the actual outcome is revealed.
The goal is to minimize the number of mistakes, i.e. minimize , where .
Perfect Expert
Let us first consider the situation in which there exists a perfect expert, i.e. an expert that makes zero mistakes.
Claim: There exists an algorithm that mistakes.
Proof.
At each time step , the algorithm consider all experts who have yet to make a mistake, and predict the majority prediction amongst their predictions. Note that, for every mistake made, the set of experts that have yet to make a mistake reduces in size by at least a factor of . Since there exists a perfect expert, this algorithm can make at most mistakes before it the set of experts who have yet to make a mistake consists solely of the perfect expert; at which point no more mistakes will be made. Thus, at most mistakes will be made by this algorithm.
Bounded Mistakes
Now, we generalize the situation a bit. There no longer exists a perfect expert, but there exists an expert that makes at most mistakes.
Claim: There exists an algorithm that makes at most .
We can essentially just extend the previous algorithm. We essentially divide the time duration into epochs: each epoch, we run the algorithm from the perfect expert scenario; once the set of experts is empty, we start a new epoch with all experts.
In each expert, each expert makes at least one mistake. Thus, the number of completed epochs is at most . In each completed epoch, we make at most mistakes, and in the last epoch (which does not complete, since at least one expert will no longer make mistakes), at most (to determine the now-perfect expert). Thus, the algorithm makes at most mistakes.
Weighted Majority Algorithm
But, we can do better—with an algorithm known as the weighted majority algorithm.
First, we describe the structure of the algorithm. The intuition behind the algorithm is to bias our predictions towards experts that appear to perform better. We assign a weight to each expert . We denote the weight of an expert at the beginning of round as . Initially, all weights are . Also, let denote the prediction of expert in round .
- In round , predict the outcome that maximizes the sum of the weights of experts that predicted it. Formally, .
- Then, after the outcome is revealed, we update the weights for the next round. If , . Otherwise, if the expert was wrong, , where is a parameter chosen earlier. We explain the choice of in more detail later.
Claim: For this algorithm, if , the number of mistakes made by the weighted majority algorithm (WM) is at most
This is also commonly expressed as
Proof.
Let represent the aggregate weight of the set of experts at time step . That is, . Note that
- , since initially.
- , since experts' weights are only maintained or decreased.
Also, consider the event where this algorithm makes a mistake in round . Then, the sum of weights of the wrong experts is higher than the sum of the weights of the correct experts. That is,
Considering that
We can derive
Now, consider the sum of the weights of the next round, .
Where the last step follows from the inequality we previously derived.
In other words, the sum of all the weights decreases by at least a factor of if the algorithm makes a mistake. Now consider any expert , such that it has made mistakes after rounds. Let the algorithm have made mistakes. Then,
Rearranging gives
Since we are guaranteed an expert makes at most mistakes, we note that
as desired.
Arbitrary
Now, we generalize our claim and proof to arbitrary* . As in, .
Claim: For the weighted majority algorithm, if , we guarantee .
Proof.
We leave out most of the analytical details for brevity, due to similarities with the previous proof. In short, we derive
Subsequently, we derive
Rearranging gives
Then, using the inequality
gives us
Which simplifies nicely to
Given that one expert will make at most mistakes, this naturally gives us our tightest upper bound.
Approximation Bound
Note that the above weighted majority algorithm can have a mistakes bound as close as possible to as desired (by making smaller). In fact, this is as close as we can get; at least, with a deterministic algorithm.
Claim: No deterministic algorithm can do better than a factor of , compared to the best expert.
Consider a scenario with two experts and . Let the set of choices be . Let always predict and always predict . Given that is deterministic, an adversarial set of outcomes may be prepared such that 's predictions are wrong. For instance, if is the weighted majority algorithm, if we always tiebreak by choosing , a sequence of outcomes will result in never being right. Regardless, for all deterministic algorithms , at least one of and will have an error rate of . (More precisely, their error rates should add to , since predicts always and always predicts ; therefore, at least one has an error rate ). Yet, due to the adversarially chosen outcomes, 's error rate is always . Therefore, will make, at best, twice as many mistakes as the best expert. Thus, it's not possible to design a deterministic algorithm that, for all possible scenarios of experts and outcomes, makes less than twice as many mistakes as the best expert.
Randomized Weighted Majority
Of course, the above proof does not apply to randomized algorithms. So, can we do better?
The randomized weighted majority algorithm (RWM) was designed precisely for this reason. In short, the weights evolve in the exact same way; however, the prediction at each time step is randomly chosen according to the distribution of the weights of the experts. You may imagine this as randomly choosing a single expert's opinion, weighted by the distribution. Succinctly,
Claim: Let . For any fixed sequence of predictions, the expected number of mistakes made by the randomized weighted majority algorithm is at most
The gap between the algorithm's expected performance and that of the best expert is known as the regret of the algorithm. We will see soon that this algorithm has effectively negligible regret.
Proof.
Let be the fraction of total weight assigned to incorrect experts at time , i.e.
Also, note that , i.e. the expected number of mistakes made by the algorithm is the sum of the fractions of weight for the incorrect experts at each time step . This derives itself from linearity of expectation; the expected number of mistakes at time step is simply , the probability of making a mistake at time step . Thus, the expected number of mistakes in total is the sum of all such .
By the weight adjustment procedures, we derive
Once again, consider an expert that has made mistakes after time steps.
Taking logarithms of both sides, we get
And using the approximation and rearranging gives
And, of course, given that one expert makes at most mistakes, we derive that this algorithm is expected to make at most mistakes.
Finally, with careful choice of , we can reach "negligible regret." Let . Then,
And since , this algorithm, with this choice of , has effectively negligible regret for each time step.
Generalized Experts
Finally, we conclude with the generalization of the expert's problem, and a similar "negligible regret" algorithm that achieves .
In the general setting, there exists a set of actions you can take at each step (i.e., like experts). On day , one action must be chosen; let this action be denoted . Afterwards, the cost is revealed (i.e. like means a mistaken prediction, means correct; only, in this generalized problem, the cost can be etc.). The goal is to minimize the cost function. Or, more specifically, the goal is to minimize the regret—the difference between the cost of the most optimal action (i.e., total cost of choosing this action across all days) and the actual cost.
Multiplicative Weights
The Multiplicative Weights algorithm is similar to the Randomized Weighted Majority algorithm. We initialize weights similarly, i.e. . For each time step ,
- Let .
- Select action with probability .
- Costs are then revealed.
- Set .
We won't prove this (mainly because the lecture notes don't have a proof...), but we can show that selecting , like before, results in , or . (In fact, it results in the same upper bound on the number of mistakes made).
Multiplicative Weights: Applications
The generalization of the experts problem, and the corresponding "negligible regret" multiplicative weights algorithm, is powerful and can be applied to many different problems. We briefly discuss a few of these applications.
Minimax Theorem (Zero Sum Games)
Consider a zero sum game with the payoff matrix
Where .
Provided mixed strategies (row) and (column), we calculate
The Minimax Theorem states that
In other words, the score of the game is the same, regardless of whether the column player plays first (LHS) or the row player plays first (RHS).
We briefly introduce some notation. Let be known as the gain-floor—the minimum payoff the row player will receive given the column player's attempt to minimize their payoff. Let be known as the loss-ceiling—the maximum loss the column player will experience given the row player's attempt to maximize their payoff.
Proof.
It's easy to show the direction. In natural language, this direction implies that moving first is, at the very least, not an advantage (i.e. the situation of the column player playing first will not produce a higher score than the situation of the row player playing first). This should make sense, intuitively, but we will formalize this intuition.
Consider that, for any pair of strategies , we have
Hopefully this makes sense, intuitively. The LHS is choosing a vector such that it minimizes the score, based on the value of . This is clearly the score of the pair of strategies , since is chosen to be score-minimizing. Meanwhile, the score of is the RHS, since the RHS is choosing a score-maximizing vector, based on the value of .
From the RHS of the inequality, we note that
This inequality holds for all , since the previous inequality held for all pairs . Therefore, it clearly holds if is the score-maximizing value, i.e.
However, it is comparatively difficult to show the direction. To do so, we must either use the strong duality of LPs, or use the multiplicative weights algorithm.
Intuitively, we can use online learning to allow these players to repeatedly play the game and converge on the optimal solution. Consider the following procedure. For time steps,
- The column player selects picks a strategy using multiplicative weights.
- The row player chooses the "best response" strategy .
- The cost vector for the column player is then revealed.
- The column player's expected loss is then .
Let us define and , i.e. the averages of the chosen strategies over all time steps. We aim to show that and are "approximate minimax strategies." In other words, we aim to show that
Claim: The following inequality is true.
Proof.
(Explanation of certain steps included below.)
- derives from the definition of .
- derives from convexity.
- derives from the fact that the row player, at each time step , chooses the optimal response, and therefore is the score-maximizing value for .
- derives from the multiplicative weights algorithm, via an analysis similar to that of the expected number of mistakes analysis.
- derives from the definition of .
Finally, we can conclude the proof of the minimax theorem, knowing this inequality is true.
- derives from the definition of the function.
- derives directly from the inequality we previously proved.
- derives from the definition of the function.
Finally, we note, as before, that , and therefore
as desired.
Approximate Solutions to LPs
I... got too lazy to take notes on this part. Feel free to take a look at these notes from CMU's 15859 offering from Fall 2011, though, if you're interested.
Sources
- Dartmouth CS31
- CMU 15850
- Princeton COS511
- Berkeley ECON 201B