Lecture 15: Model-Based RL
Learning a Simulator
Model-based RL, in contrast to the model-free RL methods discussed thus far, attempts to learn a model of the problem environment in order to produce an optimal policy. Here's a sketch of a simple simulator-learning algorithm.
- Sample data using base policy .
- Learn simulator with some loss function, e.g. MSE or MLE.
- Run favorite RL method inside simulator . This could simply be a model-free algorithm, or it could be a planning/trajectory optimization algorithm that takes advantage of our simulator .
There can be some complications, however, with each step.
- The dataset must be somewhat comprehensive with regards to the state space, thus the choice of is extremely important to the consequent dataset quality. (Statistics/algorithms problem).
- The design of the simulator model is domain-dependent since the model complexity and structure depend heavily on the problem complexity and structure. (Deep learning/models problem).
- In general, model-free RL in the learned simulator may be difficult; for instance, the simulator may be very computationally inefficient, making sampling expensive. (Controls/RL problem).
Today, we'll primarily discuss the first step/problem.
Distributional Shift and Uncertainty
Let be the optimal policy learned under the simulator . The key issue with problem 1 is that the state distribution observed in the dataset produced by is not the same as the state distribution observed by ; in other words, we have a distributional shift problem, just like in Lecture 2 with imitation learning!
Well, what if we add the following step to our algorithm?
- Sample data using base policy .
- Learn simulator .
- Learn optimal policy under .
- .
Note the similarity between this modification and DAgger.
Does this fix the problem? Unfortunately, no; while this is an improvement, this is not sufficient. In particular, step 3, which learns the optimal policy , will actually end up exploiting the mistakes or misrepresentations in our simulator . This encourages extreme policies, which massively slows down training speed not only because it produces policies with bad generalization but also because those policies will produce wildly varying data.
So, what if we replace step 3 with just a small improvement to our policy, like in PPO? This is a good idea, but what does "small" mean here? Ideally, we'd like larger improvements to learn faster, while also being sufficiently small such that our model does not overfit to our erroneous simulator . Generally speaking, answering this question is a matter of hyperparameter tuning during training.
Also, how exactly do we enforce this "small" improvement? We have a few options.
- Add a trust region (TRPO) or KL divergence constraint .
- Stick to regions where the simulator model is "confident." This may be implemented with a probabilistic, uncertainty-aware model or a pessimism penalty when presented with any model uncertainty. For the first, in step 3, we additionally extract the best policy in expectation under our probabilistic model.
For the probabilistic model, there are notably a few caveats to consider.
- We'd like to promote exploration when collecting our dataset to better improve our simulator model.
- Extracting the best policy in expectation is not the same as a pessimistic/optimistic value.
Uncertainty-Aware Neural Networks
Consider first a neural network that outputs a distribution over the state space. What if we use the entropy of the output distribution to measure confidence? Unfortunately, this is problematic because the simulator model often becomes erroneously confident due to excessive overfitting.
In particular, there are really two types of uncertainty.
- Aleatoric or statistical uncertainty: "how random does the model think the data is?"
- Epistemic or model uncertainty: "the model is certain about the data, but we are uncertain about the model itself!"
Epistemic uncertainty is what we are trying to reduce; let's try and estimate it!
In typical MLE, we estimate . However, what if, instead of learning just a single, likelihood-maximizing parameter , we learned the conditional distribution of the parameters, i.e. we learned . This distribution's entropy captures our epistemic uncertainty—if we're certain about the model, the distribution will have extremely low entropy, with most probability mass being concentrated on a single point. In contrast, if we're uncertain, the distribution will have high entropy.
Moreover, with , we can make predictions based on the entire distribution of our parameters, i.e.
Learning an entire distribution is generally intractable/much harder than learning a single parameter estimate—in practice, we approximate the distribution.
One such approximation a Bayesian neural network. A Bayesian neural network replaces weight parameters with distributions over each weight parameter. This corresponds to a fully factorized posterior, as we are approximating our distribution with , which includes a posterior distribution for every individual parameter. Notably, this does not allow representing covariances between parameters; but it is simple. Also, we typically represent , and thus the number of parameters really only doubles. And, how are they trained? Variational inference. We won't discuss the derivation of this in detail.
Another, simpler approximation is bootstrap ensembles, in which we learn multiple models, and synthesize them to produce a prediction . (See chapter 7 from Goodfellow et al.). But how do we train these models so that they don't simply learn the same parameters? We could partition our dataset into disjoint sets for models; however, this may be a very inefficient use of our dataset. Instead, we can choose to produce by sampling with replacement from for the same size as the number of data points in . The "with replacement" sampling produces sufficiently independent datasets that translate to sufficiently independent models.
In practice, bootstrap ensembles operate with a couple caveats/considerations: the number of models is typically small for efficiency reasons, which does lower approximation accuracy, and SGD with random parameter initialization typically produces sufficiently independent models without resampling with replacement.