Lecture 7: Value-Based RL
Actor-Critic without the Actor
Recall the off-policy actor-critic algorithm from the previous lecture. Now, image we have a small, discrete action space. Instead of having a distinct actor model and critic model, we can just have a critic model, and define the actor's policy implicitly as based on the critic, i.e.
πθ(a∣s)={1,0,a=aargmax Q^ϕπ(s,a)otherwise
This simplifies our algorithm to the following steps. Note that we'll rename our parameters to Qθ instead of Qϕ to follow standard terminology.
- Get (si,ai,si′) by taking one step, store in replay buffer
- Evaluate yi=r(si,ai)+γEa′∼πθ(a′∣si′)[Q^θ(si′,a′)]=r(si,ai)+γa′max Q^θ(si′,a′).
- Update ϕ using ∇ϕ∑i=1B∥Q^θ(si,ai)−yi∥2.
This is known as an off-policy critic algorithm, and is known as Q-learning.
Policy Iteration and Dynamic Programming
We'll now discuss a different sort of thinking path that leads to the same conclusion, i.e. Q-learning.
Policy iteration, at a high level, is a two-step algorithm.
- Policy Evaluation: Fit Qπ or Vπ.
- Policy Improvement: Improve the policy π←π′.
Actor-critic and Q-learning are both algorithms within this family.
Value Function DP
First, we assume knowledge of p(s′∣s,a) and that the state and action space are small and discrete. For instance, consider a small grid of states, with an action space of moving up, down, left, or right. We can store the entirety of Vπ(s) in a table, and perform bootstrapped policy evaluation
Vπ(s)←Ea∼π(a∣s)[r(s,a)+γEs′∼p(s′∣s,a)[Vπ(s′)]]
where we calculate over the entire space, i.e. all (s′,a′), which provides an exact calculation, not an estimate (no sampling), of the updated value function. Afterwards, we may update our policy, i.e.
πθ(a∣s)={1,0,a=aargmax Q^ϕπ(s,a)otherwise
to produce a deterministic policy. Notably, the deterministic policy actually allows us to simplify the policy evaluation equation to
Vπ(s)←r(s,π(s))+γEs′∼p(s′∣s,a)[Vπ(s′)]
Notably, since there is no sampling, and thus no policy running, this is simply dynamic programming, and not really any reinforcement learning! This is also known as tabular policy learning.
Value Iteration
We can also use Q functions for DP. In essence, we change our policy evaluation step to
Qπ(s,a)←r(s,a)+γEs′∼p(s′∣s,a)[Vπ(s′)]
and change our policy improvement step to
Vπ(s)←amax Qπ(s,a)
where our policy is implicitly defined by greedily following the value function. This is known as value iteration, which is essentially just a one step algorithm.
Fitted Value Iteration
Now, let's imagine our state space has grown much larger, and thus it is no longer tractable to represent Vπ in a tabular format. How do we represent V(s)?
Of course, the natural idea is to use a neural network that regresses on V(s) with an MSE loss function. This produces the fitted value iteration algorithm.
- Set yi←aimax (r(si,ai)+γE[Vθ(si′)]).
- Set θ←θargmin 21∑i∥Vθ(si)−yi∥2.
Notably, we still require knowledge of the state transition probabilities and the action space must be small and discrete, due to the nature of step 1.
In practice, however, we won't know state transition probabilities, and we need to use samples rather than iterating over the entire action space...
Fitted Q Iteration
The trick is to simply use Q-functions instead of value functions!
- Set yi←r(si,ai)+γE[Vθ(s′)], where we can approximate E[Vθ(s′)]≈a′max Qθ(si′,ai′).
- Set θ←θargmin 21∑i∥Qθ(si,ai)−yi∥2.
The key idea is that the Q function, Q:S×A→R, is also used to compute an expected value over actions, but does so without needing extra sampling/knowing the state transition probabilities; the best action is already known according to the estimated Qθ, by taking the max value over existing samples.
Why doesn't this work with value functions?
With value functions, if we were to compute yi using samples ai, we would be evaluating our current policy π′ using samples from an old policy π... not good! This results in an outdated policy evaluation, and therefore the target values for Vθ aren't representative of the true, current Vθ, i.e. the labels yi are outdated. Hence, fitted value iteration is not an off-policy algorithm; it does not use any samples, but rather iterates over the entire action space and uses its knowledge of the state transition probabilities to compute its estimate of the value function. Fitted Q iteration is able to act as an off-policy algorithm, and uses samples to estimate Q.
tip
There are two ways to model the a′argmax Qθ(si′,ai′).
- Make a neural network that takes in si,ai and outputs the expected Q-value. This is the intuitive method (model Q(s,a) directly), and is the implied method above. (Max taken over existing samples).
- If the action space is small, make a neural network that takes in si and outputs the expected Q-value for all ai. Then, the arg max is just computed directly over the resulting outputs, without needing to use existing sample actions.
Q-learning works for off-policy samples, and only has one neural network, and therefore no high-variance policy gradient. However, it lacks convergence guarantees for nonlinear function approximations—an issue we'll discuss more later.
Our full fitted Q-iteration algorithm is thus
- Collect B samples (si,ai,si′) to add to the replay buffer.
- Set yi=r(si,ai)+γa′max Qθ(si′,ai′).
- Set θ←θargmin 21∑i∥Qθ(si,ai)−yi∥2.
- Repeat steps 2-3 K times, for some hyperparameter K.
- Go back to step 1 and repeat.
Return to Q-Learning
Q-Iteration Optimization Objective
First, what is fitted Q-iteration really optimizing?
Well, it's performing some regression, and attempting to compute parameters θ for our model Qθ that minimize the error E
E=21E(s,a)∼β[(Qθ(s,a)−yi)2]=21E(s,a)∼β[(Qθ(s,a)−[r(s,a)+γa′max Qθ(s′,a′)])2]
where β is the dataset/replay buffer we're sampling from.
If E=0, then Qθ(s,a)=r(s,a)+γa′max Qθ(s′,a′), and Qθ is an optimal Q-function Q∗ that corresponds to the optimal policy π∗.
Online Q-Learning
We can produce the online Q-learning algorithm by simply using fitted Q-iteration with B,K=1.
- Take some action a, observe (si,ai,si′).
- yi=r(si,ai)+γa′max Qθ(si′,ai′).
- θ←θ−αdθdQθ(si,ai)(Qθ(si,ai)−yi).
Notably, this is still off-policy, unlike the online actor-critic algorithm we covered last lecture. Thus, we have a choice in sampling (si,ai,si′); most importantly, it doesn't have to be the result of taking the best action according to our current policy. This allows us to explore more of the action space.
Exploratory Q-Learning
We can extend this idea of exploration with ϵ-greedy methods! In essence, we act according to the policy
π(at∣st)={1−ϵ,ϵ/(∣A∣−1),if at=atargmax Qθ(st,at)otherwise
which chooses the greedy action with probability 1−ϵ and chooses a different action uniformly at random with probability ϵ.
An alternative is to use a soft max, which ensures
π(at∣st)∝exp(Qθ(st,at))
This is known as Boltzmann exploration, and basically weights the non-greedy actions according to their estimated values, rather than just choosing from them randomly.
warning
Simply applying the Q-Learning algorithm as it is now is not sufficient; we need some tricks to make it work in practice, which we'll discuss in future lectures!