Markov Property
In RL problems, states follow the Markov property, i.e. the future state depends only on the current state, not the previous states that led up to it. In other words, the state is memoryless.
Partially observed RL
Actions at, states st, observations ot.
πθ(at∣ot)p(ot∣st)p(st+1∣st,at)
Fully observed RL ot=st, so observations are unnecessary.
πθ(at∣st)p(st+1∣st,at)
Behavioral Cloning
What if we wanted to do supervised learning? When collecting data, we can measure several demonstration trajectories
{(o1(i),a1(i),…,oH(i),aH(i))}i=1N
Behavioral cloning, a form of imitation learning, involves maximizing the log-probability of the policy—this is effectively just supervised learning/maximum likelihood estimation with a neural network.
θargmaxi=1∑Nt=1∑Hlogπθ(at(i),ot(i))
The algorithm itself would take the raw data (e.g. image) and process it with an encoder to produce the action distribution parameters. For discrete actions, these parameters are logits that are then processed with a softmax. For continuous actions, these parameters are actual distribution parameters, i.e. they are values like Gaussian means and covariance that parametrize the distributions of each action.
Does Behavioral Cloning Work?
The main issue with behavioral cloning is that, in RL, data is not i.i.d., i.e. earlier actions in the sequence affect later states. However, after a slight divergence from the training trajectory (due to a small error from the model), the error will be amplified across the future trajectory; as the model diverges further from the training trajectory, its error subsequently grows. This problem may be addressed in several ways:
Change the algorithm (DAgger)
Use better models that make fewer mistakes
Smarter data collection/augmentation
Multi-task learning
But first, we formalize the notion of this divergence, which is termed distributional shift, i.e. ptrain(x)=ptest(x). At some point in testing, pπθ(ot)=pdata(ot). We'll now try and quantify just how bad this distributional shift can be for behavioral cloning. Note that this theoretical view is not necessarily representative of real-world, typical behavior.
Assume the data is produced by a good policy π∗(st). (Note that we are using the observed case for simplicity). We define a cost function
c(st,at)={0,1,at=π∗(st)otherwise
The expected value of our cost at time step t w.r.t. actions sampled from πθ is thus
Note that st∼pπθ(st) represents the distribution of possible states st at time t for the policy πθ, and it depends on past actions.
Let's assume πθ(a=π∗(s)∣s)≤ϵ, ∀s∈Dtrain. In other words, the probability of making a mistake is ≤ϵ for each state in the dataset. Then,
E[M]≤ϵH+(1−ϵ)(ϵ(H−1)+(1−ϵ)(…))≤ϵH2
In other words, the expected number of mistakes increases quadratically in the horizon. This assumption, however, does not generalize to most datasets.
Instead, we can make a more realistic assumption, in which, for any state sampled from the training distribution (not just the states observed in the training dataset), we are unlikely to make a mistake, i.e.
Est∼ptrain(st)[πθ(at=π∗(st)∣st)]≤ϵ
Then, we can note the probability of some state st as
From the upper-bound on TVD ≤1 and the identity (1−ϵ)t≥1−ϵt for ϵ∈[0,1], we find
DTV(ptrain,pπθ)≤(1−(1−ϵ)t)≤ϵt
In other words, the distance between probability distributions increases linearly in t. This bound allows us to derive, after several simplifications, that
E[M]≤t=1∑Hϵ+2ϵt
or that E[M]∈O(ϵH2). Thus, the error increases quadratically with the horizon.
□
This analysis, however, is pessimistic, as it assumes we cannot recover from mistakes. Paradoxically, then, this implies that imitation learning can achieve better performance if the data includes more mistakes and subsequent recoveries.
Solution 1: DAgger:Dataset Aggregation
We'd like to make pdata(ot)≈pπθ(ot), i.e. reduce distributional shift. DAgger focuses on being smart about pdata(ot) for this purpose—its goal is to effectively collect training data from pπθ(ot) instead of pdata(ot)!
Train πθ(at∣ot) from human data D.
Run πθ(at∣ot) to produce the model dataset Dπ.
Ask a human to label Dπ with actions at.
Aggregate D←D∪Dπ.
Repeat.
This iteratively aligns the training dataset to further match the states/observations that the policy πθ experiences.
The main difficulty with DAgger, however, is the production of labels by humans. This is not only expensive, but humans may not even be able to classify each observation, depending on the task.
A common variant of DAgger aimed to fix this issue is to replace human labeling with human intervention.
Train πθ(at∣ot) from human data D.
Run πθ(at∣ot).
Ask a human to intervene or take over at some time step t.