12. Online Algorithms

An online algorithm is one that can process data one-by-one and make more informed decisions based on past data. Essentially, an adaptive learning algorithm.

Experts

Consider a set of $N$ experts that each make a prediction about a certain event for $T$ time steps. Let $\mathcal{U}$ be the set of possible predictions. The problem proceeds as follows.

At time step $t$ , each expert makes a prediction; let $\mathcal{E}^{t}\in \mathcal{U}^{N}$ be the vector of predictions.
The algorithm makes a prediction $a^{t}$ , and then the actual outcome $o^{t}$ is revealed.

The goal is to minimize the number of mistakes, i.e. minimize $\lvert M \rvert$ , where $M=\{ t\mid a^{t}\neq o^{t} \}$ .

Perfect Expert

Let us first consider the situation in which there exists a perfect expert, i.e. an expert that makes zero mistakes.

Claim: There exists an algorithm that $\lceil \log_{2}N \rceil$ mistakes.

Proof.
At each time step $t$ , the algorithm consider all experts who have yet to make a mistake, and predict the majority prediction amongst their predictions. Note that, for every mistake made, the set of experts that have yet to make a mistake reduces in size by at least a factor of $2$ . Since there exists a perfect expert, this algorithm can make at most $\lceil \log_{2}N \rceil$ mistakes before it the set of experts who have yet to make a mistake consists solely of the perfect expert; at which point no more mistakes will be made. Thus, at most $\lceil \log_{2}N \rceil$ mistakes will be made by this algorithm.

$\square$

Bounded Mistakes

Now, we generalize the situation a bit. There no longer exists a perfect expert, but there exists an expert that makes at most $m^{*}$ mistakes.

Claim: There exists an algorithm that makes at most $M\leq m^{*}(\lceil \log_{2}N \rceil+1)+\lceil \log_{2}N \rceil$ .

We can essentially just extend the previous algorithm. We essentially divide the time duration $T$ into epochs: each epoch, we run the algorithm from the perfect expert scenario; once the set of experts is empty, we start a new epoch with all $N$ experts.

In each expert, each expert makes at least one mistake. Thus, the number of completed epochs is at most $m^{*}$ . In each completed epoch, we make at most $\lceil \log_{2}N \rceil+1$ mistakes, and in the last epoch (which does not complete, since at least one expert will no longer make mistakes), at most $\lceil \log_{2}N \rceil$ (to determine the now-perfect expert). Thus, the algorithm makes at most $m^{*}(\lceil \log_{2}N \rceil+1)+\lceil \log_{2} N\rceil$ mistakes.

$\square$

Weighted Majority Algorithm

But, we can do better—with an algorithm known as the weighted majority algorithm.

First, we describe the structure of the algorithm. The intuition behind the algorithm is to bias our predictions towards experts that appear to perform better. We assign a weight $w_{i}$ to each expert $i\in \{ 1,\dots,N \}$ . We denote the weight of an expert $i$ at the beginning of round $t$ as $w_{i}^{(t)}$ . Initially, all weights are $1$ . Also, let $e_{i}^{(t)}$ denote the prediction of expert $i$ in round $t$ .

In round $t$ , predict the outcome that maximizes the sum of the weights of experts that predicted it. Formally, $a^{t}=\mathrm{arg}\ \underset{ u\in \mathcal{U} }{ \mathrm{max} }\underset{ i:\ e_{i}^{(t)}=u }{ \sum } w_{i}^{(t)}$ .
Then, after the outcome $o^{t}$ is revealed, we update the weights for the next round. If $e_{i}^{(t)}=o^{t}$ , $w_{i}^{(t+1)}=w_{i}^{(t)}$ . Otherwise, if the expert was wrong, $w_{i}^{(t+1)}=w_{i}^{(t)}\cdot(1-\epsilon)$ , where $\epsilon$ is a parameter chosen earlier. We explain the choice of $\epsilon$ in more detail later.

$\epsilon=1/2$

Claim: For this algorithm, if $\epsilon=\frac{1}{2}$ , the number of mistakes made by the weighted majority algorithm (WM) is at most
$2.41(m^{*}+\log_{2}N)$
This is also commonly expressed as
$2.41m^{*}+O(\log_{2}N)$

Proof.
Let $Z^{t}$ represent the aggregate weight of the set of experts at time step $t$ . That is, $Z^{t}=\underset{ i }{ \sum }w_{i}^{(t)}$ . Note that

$Z^{1}=N$ , since $w_{i}^{(t)}=1,\forall i$ initially.
$Z^{t+1}\leq Z^{t},\forall t$ , since experts' weights are only maintained or decreased.

Also, consider the event where this algorithm makes a mistake in round $t$ . Then, the sum of weights of the wrong experts is higher than the sum of the weights of the correct experts. That is,

\underset{ i:\ e_{i}^{(t)}\neq o_{t} }{ \sum } w_{i}^{(t)}\geq \underset{ i:\ e_{i}^{(t)}= o_{t} }{ \sum } w_{i}^{(t)}

Considering that

Z^{t}= \underset{ i:\ e_{i}^{(t)}\neq o_{t} }{ \sum } w_{i}^{(t)}+ \underset{ i:\ e_{i}^{(t)}= o_{t} }{ \sum } w_{i}^{(t)}

We can derive

Z^{t}\leq 2\underset{ i:\ e_{i}^{(t)}\neq o_{t} }{ \sum } w_{i}^{(t)}

Now, consider the sum of the weights of the next round, $Z^{t+1}$ .

\begin{align*} Z^{t+1} &= \underset{ i:\ e_{i}^{(t)}\neq o_{t} }{ \sum } w_{i}^{(t+1)}+ \underset{ i:\ e_{i}^{(t)}= o_{t} }{ \sum } w_{i}^{(t+1)} \\ &= \frac{1}{2} \underset{ i:\ e_{i}^{(t)}\neq o_{t} }{ \sum } w_{i}^{(t)}+ \underset{ i:\ e_{i}^{(t)}= o_{t} }{ \sum } w_{i}^{(t)} \\ &= Z^{t}-\frac{1}{2}\underset{ i:\ e_{i}^{(t)}\neq o_{t} }{ \sum } w_{i}^{(t)} \\ & \leq \frac{3}{4}Z^{t} \end{align*}

Where the last step follows from the inequality we previously derived.

In other words, the sum of all the weights decreases by at least a factor of $25\%$ if the algorithm makes a mistake. Now consider any expert $i$ , such that it has made $m_{i}$ mistakes after $t$ rounds. Let the algorithm have made $M$ mistakes. Then,

\left( \frac{1}{2} \right)^{m_{i}}=w_{i}^{(t+1)}\leq Z^{t+1}\leq Z^{t}\left( \frac{3}{4} \right)^{M}=N\left( \frac{3}{4} \right)^{M}

Rearranging gives

M\leq \frac{m_{i}+\log_{2}N}{\log_{2}\left( \frac{4}{3} \right)}\leq 2.41(m_{i}+\log_{2}N)

Since we are guaranteed an expert makes at most $m^{*}$ mistakes, we note that

M\leq 2.41(m^{*}+\log_{2}N)

as desired.

$\square$

Arbitrary $\epsilon$

Now, we generalize our claim and proof to arbitrary* $\epsilon$ . As in, $\epsilon \in(0,\ 1/2)$ .

Claim: For the weighted majority algorithm, if $\epsilon \in(0,\ 1/2)$ , we guarantee $M\leq2(1+\epsilon)m^{*}+O\left( \frac{\log N}{\epsilon} \right)$ .

Proof.
We leave out most of the analytical details for brevity, due to similarities with the previous proof. In short, we derive

Z^{t+1}\leq \left( 1-\frac{\epsilon}{2} \right)Z^{t}

Subsequently, we derive

(1-\epsilon)^{m_{i}}\leq Z^{t}\leq Z^{1}\left( 1-\frac{\epsilon}{2} \right)^{M}=N\left( 1-\frac{\epsilon}{2} \right)^{M}\leq N\exp(-\epsilon M/2)

Rearranging gives

M\leq \frac{-m_{i}\log(1-\epsilon)+\ln N}{\epsilon/2}

Then, using the inequality

-\ln(1-\epsilon)=\epsilon+\frac{\epsilon^{2}}{2}+\frac{\epsilon^{3}}{3}+\dots \leq \epsilon+\epsilon^{2},\ \epsilon \in[0,1]

gives us

M\leq 2 \frac{m_{i}(\epsilon+\epsilon^{2})}{\epsilon} + O\left( \frac{\log N}{\epsilon} \right)

Which simplifies nicely to

M\leq 2(1+\epsilon)m_{i}+O\left( \frac{\log N}{\epsilon} \right)

Given that one expert will make at most $m^{*}$ mistakes, this naturally gives us our tightest upper bound.

$\square$

Approximation Bound

Note that the above weighted majority algorithm can have a mistakes bound as close as possible to $2m^{*}$ as desired (by making $\epsilon$ smaller). In fact, this is as close as we can get; at least, with a deterministic algorithm.

Claim: No deterministic algorithm $\mathcal{A}$ can do better than a factor of $2$ , compared to the best expert.

Consider a scenario with two experts $E_{1}$ and $E_{2}$ . Let the set of choices be $\mathcal{U}=\{ 0,1 \}$ . Let $E_{1}$ always predict $0$ and $E_{2}$ always predict $1$ . Given that $\mathcal{A}$ is deterministic, an adversarial set of outcomes may be prepared such that $\mathcal{A}$ 's predictions are wrong. For instance, if $\mathcal{A}$ is the weighted majority algorithm, if we always tiebreak by choosing $E_{1}$ , a sequence of outcomes $1,0,1,0,1,0\dots$ will result in $\mathcal{A}$ never being right. Regardless, for all deterministic algorithms $\mathcal{A}$ , at least one of $E_{1}$ and $E_{2}$ will have an error rate of $\leq \frac{1}{2}$ . (More precisely, their error rates should add to $1$ , since $E_{1}$ predicts $0$ always and $E_{2}$ always predicts $1$ ; therefore, at least one has an error rate $\leq \frac{1}{2}$ ). Yet, due to the adversarially chosen outcomes, $\mathcal{A}$ 's error rate is always $1$ . Therefore, $\mathcal{A}$ will make, at best, twice as many mistakes as the best expert. Thus, it's not possible to design a deterministic algorithm that, for all possible scenarios of experts and outcomes, makes less than twice as many mistakes as the best expert.

$\square$

Randomized Weighted Majority

Of course, the above proof does not apply to randomized algorithms. So, can we do better?

The randomized weighted majority algorithm (RWM) was designed precisely for this reason. In short, the weights evolve in the exact same way; however, the prediction at each time step is randomly chosen according to the distribution of the weights of the experts. You may imagine this as randomly choosing a single expert's opinion, weighted by the distribution. Succinctly,

\mathrm{P}[a^{t}=u]= \frac{{ \sum }_{i:\ e_{i}^{(t)}=u}\ w_{i}^{(t)}}{Z^{t}}

Claim: Let $\epsilon= \frac{1}{2}$ . For any fixed sequence of predictions, the expected number of mistakes made by the randomized weighted majority algorithm is at most
$\mathrm{E}[M]\leq (1+\epsilon)m^{*}+O\left( \frac{\log N}{\epsilon} \right)$

Regret

The gap $\epsilon m^{*}+O\left( \frac{\log N}{\epsilon} \right)$ between the algorithm's expected performance and that of the best expert is known as the regret of the algorithm. We will see soon that this algorithm has effectively negligible regret.

Proof.
Let $F^{t}$ be the fraction of total weight assigned to incorrect experts at time $t$ , i.e.

F^{t}= \frac{{ \sum }_{i:\ e_{i}^{(t)}\neq o^{t}}\ w_{i}^{(t)}}{Z^{t}}

Also, note that $\mathrm{E}[M]=\underset{ t\in[T] }{ \sum } F_{t}$ , i.e. the expected number of mistakes made by the algorithm is the sum of the fractions of weight for the incorrect experts at each time step $t$ . This derives itself from linearity of expectation; the expected number of mistakes at time step $t$ is simply $F_{t}$ , the probability of making a mistake at time step $t$ . Thus, the expected number of mistakes in total is the sum of all such $F_{t}$ .

By the weight adjustment procedures, we derive

Z^{t+1}=Z^{t}((1-F^{t})\cdot1+F^{t}\cdot(1-\epsilon))=Z^{t}(1-\epsilon F^{t})

Once again, consider an expert that has made $m_{i}$ mistakes after $t$ time steps.

(1-\epsilon)^{m_{i}}\leq Z^{t+1}=Z^{1}\prod_{t'=1}^{t} (1-\epsilon F_{t'})\leq Ne^{ -\epsilon \sum F_{t} }=Ne^{ -\epsilon \mathrm{E}[M] }

Taking logarithms of both sides, we get

m_{i}\ln(1-\epsilon)\leq \ln N-\epsilon E[M]

And using the approximation $-\log(1-\epsilon)\leq\epsilon+\epsilon^{2}$ and rearranging gives

E[M]\leq m_{i}(1+\epsilon)+\frac{\ln N}{\epsilon}

And, of course, given that one expert makes at most $m^{*}$ mistakes, we derive that this algorithm is expected to make at most $m^{*}(1+\epsilon)+\frac{\ln N}{\epsilon}$ mistakes.

$\square$

Finally, with careful choice of

\epsilon

, we can reach "negligible regret." Let

\epsilon= \sqrt{ \frac{\log N}{T} }

. Then,

\begin{align*} E[M] &\leq (1+\epsilon)m^{*}+ \frac{\log N}{\epsilon} \\ &\leq m^{*}+\epsilon m^{*}+\sqrt{ T\log N } \\ &\leq m^{*} + \epsilon T + \sqrt{ T\log N } \\ &\leq m^{*}+2\sqrt{ T\log N } \\ \frac{E[M]}{T} &\leq \frac{m^{*}}{T} + 2\sqrt{ \frac{\log N}{T} } \\ \end{align*}

And since $\lim_{ T \to \infty }2\sqrt{ \frac{\log N}{T} }=0$ , this algorithm, with this choice of $\epsilon=\sqrt{ \frac{\log N}{T} }$ , has effectively negligible regret for each time step.

Generalized Experts

Finally, we conclude with the generalization of the expert's problem, and a similar "negligible regret" algorithm that achieves $E[\text{Total Regret}]\leq O(\sqrt{ T\log N })$ .

In the general setting, there exists a set $\mathcal{A}$ of $N$ actions you can take at each step (i.e., like $N$ experts). On day $t$ , one action $i\in \mathcal{A}$ must be chosen; let this action be denoted $i^{t}$ . Afterwards, the cost $c_{i}^{(t)}\in[0,1]$ is revealed (i.e. like $1$ means a mistaken prediction, $0$ means correct; only, in this generalized problem, the cost can be $0.5,0.123,$ etc.). The goal is to minimize the cost function. Or, more specifically, the goal is to minimize the regret—the difference between the cost of the most optimal action (i.e., total cost of choosing this action across all days) and the actual cost.

\sum_{t=1}^{T} c_{i^{t}}^{(t)} = \underset{ a\in \mathcal{A} }{ \mathrm{min} }\left\{ \sum_{t=1}^{T} c_{a}^{(t)} \right\} +\text{Regret}

Multiplicative Weights

The Multiplicative Weights algorithm is similar to the Randomized Weighted Majority algorithm. We initialize weights similarly, i.e. $w_{i}^{(1)}=1$ . For each time step $t$ ,

Let $Z^{t}=\underset{ i }{ \sum } w_{i}^{(t)}$ .
Select action $i\in \mathcal{A}$ with probability $w_{i}^{(t)}/Z^{t}$ .
Costs $\{ c_{1}^{(t)},\dots,c_{N}^{(t)} \}$ are then revealed.
Set $w_{i}^{(t+1)}=w_{i}^{(t)}\cdot(1-\epsilon c_{i}^{(t)})$ .

We won't prove this (mainly because the lecture notes don't have a proof...), but we can show that selecting $\epsilon=\sqrt{ \frac{\log N}{T} }$ , like before, results in $E[\text{Regret}]\leq O(\sqrt{ T\log N })$ , or $E[\text{Regret}]/T \leq O(\sqrt{ (\log N)/T })$ . (In fact, it results in the same upper bound on the number of mistakes made).

Multiplicative Weights: Applications

The generalization of the experts problem, and the corresponding "negligible regret" multiplicative weights algorithm, is powerful and can be applied to many different problems. We briefly discuss a few of these applications.

Minimax Theorem (Zero Sum Games)

Consider a zero sum game with the payoff matrix

M=\begin{bmatrix} M_{11} & \dots & M_{1n} \\ \vdots & \ddots & \vdots \\ M_{n_{1}} & \dots & M_{nn} \end{bmatrix}

Where $-1\leq M_{ij}\leq1$ .

Provided mixed strategies $p=\begin{bmatrix}p_{1}\\\vdots \\p_{n}\end{bmatrix}$ (row) and $q=\begin{bmatrix}q_{1}\\\vdots\\q_{n}\end{bmatrix}$ (column), we calculate

\text{Score}(p,q)=\sum_{i,j}p_{i}M_{ij}q_{j}=p^{T}Mq

The Minimax Theorem states that

\underset{ q }{ \mathrm{min} }\ \underset{p}{\mathrm{max}}\ p^{T}Mq = \underset{p}{\mathrm{max}}\ \underset{q}{\mathrm{min}}\ p^{T}Mq

In other words, the score of the game is the same, regardless of whether the column player plays first (LHS) or the row player plays first (RHS).

We briefly introduce some notation. Let $v_{1}'=\underset{p}{\mathrm{max}}\ \underset{q}{\mathrm{min}}\ p^{T}Mq$ be known as the gain-floor—the minimum payoff the row player will receive given the column player's attempt to minimize their payoff. Let $v_{2}'=\underset{q}{\mathrm{min}}\ \underset{p}{\mathrm{max}}\ p^{T}Mq$ be known as the loss-ceiling—the maximum loss the column player will experience given the row player's attempt to maximize their payoff.

Proof.
It's easy to show the $\geq$ direction. In natural language, this direction implies that moving first is, at the very least, not an advantage (i.e. the situation of the column player playing first will not produce a higher score than the situation of the row player playing first). This should make sense, intuitively, but we will formalize this intuition.

Consider that, for any pair of strategies $(p',q')$ , we have

\underset{q}{\mathrm{min}}\ (p')^{T}Mq \leq (p')^{T}Mq' \leq \underset{p}{\mathrm{max}}\ p^{T}Mq'

Hopefully this makes sense, intuitively. The LHS is choosing a $q$ vector such that it minimizes the score, based on the value of $p'$ . This is clearly $\leq$ the score of the pair of strategies $(p',q')$ , since $q$ is chosen to be score-minimizing. Meanwhile, the score of $(p',q')$ is $\leq$ the RHS, since the RHS is choosing a score-maximizing $p$ vector, based on the value of $q'$ .

From the RHS of the inequality, we note that

\begin{align*} (p')^{T}Mq' &\leq \underset{p}{\mathrm{max}}\ p^{T}Mq' \\ \underset{q}{\mathrm{min}}\ (p')^{T}Mq &\leq \underset{q}{\mathrm{min}}\ \underset{p}{\mathrm{max}}\ p^{T}Mq \\ \end{align*}

This inequality holds for all $p'$ , since the previous inequality held for all pairs $(p',q')$ . Therefore, it clearly holds if $p$ is the score-maximizing value, i.e.

\underset{p}{\mathrm{max}}\ \underset{q}{\mathrm{min}}\ p^{T}Mq \leq \underset{q}{\mathrm{min}}\ \underset{p}{\mathrm{max}}\ p^{T}Mq

However, it is comparatively difficult to show the $\leq$ direction. To do so, we must either use the strong duality of LPs, or use the multiplicative weights algorithm.

Intuitively, we can use online learning to allow these players to repeatedly play the game and converge on the optimal solution. Consider the following procedure. For $t=1,\dots,T$ time steps,

The column player selects picks a strategy $q^{(t)}$ using multiplicative weights.
The row player chooses the "best response" strategy $p^{(t)}=\underset{ p^{(t)} }{ \mathrm{arg}\ \mathrm{max} }\ \{ (p^{(t)})^{T}Mq^{(t)} \}$ .
The cost vector $c^{(t)}=p^{(t)}\cdot M$ for the column player is then revealed.
The column player's expected loss is then $\ell^{(t)}=(p^{(t)})^{T}Mq^{(t)}$ .

Let us define $\overline{p}=\frac{1}{T}\sum_{t=1}^{T}p^{(t)}$ and $\overline{q}=\frac{1}{T}\sum_{t=1}^{T}q^{(t)}$ , i.e. the averages of the chosen strategies over all time steps. We aim to show that $\overline{p}$ and $\overline{q}$ are "approximate minimax strategies." In other words, we aim to show that

\underset{p}{\mathrm{max}}\ p^{T}M\overline{q}\leq \underset{q}{\mathrm{min}}\ \overline{p}^{T}Mq + \text{regret (small)}

Claim: The following inequality is true.

\underset{p}{\mathrm{max}}\ p^{T}M\overline{q} \leq \underset{q}{\mathrm{min}}\ \overline{p}^{T}Mq+O(\sqrt{ (\log N)/T })

Proof.
(Explanation of certain steps included below.)

\begin{align*} \underset{p}{\mathrm{max}}\ p^{T}M\overline{q} &= \underset{p}{\mathrm{max}}\ \frac{1}{T}\sum_{t=1}^{T} pMq^{(t)} && (1) \\ &\leq \frac{1}{T}\sum_{t=1}^{T} \underset{p}{\mathrm{max}}\ pMq^{(t)} && (2) \\ &= \frac{1}{T}\sum_{t=1}^{T} p^{(t)}Mq^{(t)} && (3) \\ &\leq \underset{q}{\mathrm{min}}\ \frac{1}{T}\sum_{t=1}^{T} (p^{(t)})^{T}Mq +O\left( \sqrt{ \frac{\log N}{T} } \right) && (4) \\ &= \underset{q}{\mathrm{min}}\ \overline{p}^{T}Mq+O\left( \sqrt{ \frac{\log N}{T} } \right) && (5) \\ \end{align*}

$(1)$ derives from the definition of $\overline{q}$ .
$(2)$ derives from convexity.
$(3)$ derives from the fact that the row player, at each time step $t$ , chooses the optimal response, and therefore $p^{(t)}$ is the score-maximizing value for $p$ .
$(4)$ derives from the multiplicative weights algorithm, via an analysis similar to that of the expected number of mistakes analysis.
$(5)$ derives from the definition of $\overline{p}$ .

Finally, we can conclude the proof of the minimax theorem, knowing this inequality is true.

\begin{align*} \underset{q}{\mathrm{min}}\ \underset{p}{\mathrm{max}}\ p^{T}Mq &\leq \underset{p}{\mathrm{max}}\ pM\overline{q} && (1) \\ &\leq \underset{q}{\mathrm{min}}\ \overline{p}^{T}Mq + O(\sqrt{ (\log N)/T }) && (2) \\ &\leq \underset{p}{\mathrm{max}}\ \underset{q}{\mathrm{min}}\ p^{T}Mq + O(\sqrt{ (\log N)/T }) && (3) \end{align*}

$(1)$ derives from the definition of the $\mathrm{min}$ function.
$(2)$ derives directly from the inequality we previously proved.
$(3)$ derives from the definition of the $\mathrm{max}$ function.

Finally, we note, as before, that $\lim_{ T \to \infty }\sqrt{ (\log N)/T }=0$ , and therefore

\underset{q}{\mathrm{min}}\ \underset{p}{\mathrm{max}}\ p^{T}Mq \leq \underset{p}{\mathrm{max}}\ \underset{q}{\mathrm{min}}\ p^{T}Mq

as desired.

$\square$

Approximate Solutions to LPs

I... got too lazy to take notes on this part. Feel free to take a look at these notes from CMU's 15859 offering from Fall 2011, though, if you're interested.

Sources

Dartmouth CS31
CMU 15850
Princeton COS511
Berkeley ECON 201B

12. Online Algorithms

Experts

Perfect Expert

Bounded Mistakes

Weighted Majority Algorithm

ϵ=1/2\epsilon=1/2ϵ=1/2

Arbitrary ϵ\epsilonϵ

Approximation Bound

Randomized Weighted Majority

Generalized Experts

Multiplicative Weights

Multiplicative Weights: Applications

Minimax Theorem (Zero Sum Games)

Approximate Solutions to LPs

Sources

$\epsilon=1/2$

Arbitrary $\epsilon$