Chapter 11: Practical Methodology

High-level design process:

Determine goals: error metric, target error value
Build pipeline to estimate metrics
Instrument systems, diagnose performance bottlenecks
Incrementally improve algorithm

11.1 Performance Metrics

For the target error value, consider

What errors have been achieved on previously published benchmark results?
What is the maximum threshold error value the problem allows?

For the choice of target metric, consider

Which mistakes are more costly than others (e.g. false positive vs false negative)?
If the event is rare, perhaps use precision/recall instead of accuracy, or F-score for a single metric for both.
If the model can estimate its confidence in a decision, coverage, i.e. how many inputs the model can correctly process, may be used.

11.2 Default Baseline Models

Choose category/complexity of model based on structure/complexity of problem, respectively
Start with Adam or a similar optimization algorithm
Include some mild regularization, e.g. dropout, batch normalization. Early stopping should always be used!

11.3 Determining Whether to Gather More Data

Go through a checklist

Training set performance is poor $\to$ Improve model
High capacity model and optimization algorithms failing $\to$ Collect better quality data
Once acceptable training set performance, measure test set performance. If poor $\to$ Gather more data!

Note that, if gathering more data is expensive, it may be useful to add regularization, adjust hyperparameters, etc. before resorting to gathering data.

11.4 Selecting Hyperparameters

Note that, generally, the built-in defaults are good.

Manual Hyperparameter Tuning

The goal of manual hyperparameter tuning is to adjust effective capacity to match problem complexity. Effective capacity is constrained by 3 factors:

Representational capacity ( $\uparrow$ )
Capability of learning algorithm to minimize cost function ( $\uparrow$ )
Degree of regularization ( $\downarrow$ )

As usual, the optimal hyperparameters usually lie in some middle ground between underfitting and overfitting.

tip

The learning rate is the most important hyperparameter, typically!

If training set error is higher than target error rate, increase effective capacity. Otherwise, if test error is too high, decrease effective capacity.

Automatic Hyperparameter Optimization Algorithms

Essentially, wrap a model with another model that learns the hyperparameters. Except, that outside model has its own hyperparameters... these hyperparameters may be easier to choose, however!

Grid Search

When there are $\leq3$ hyperparameters, one may perform grid search, i.e. develop a set of possible values for each hyperparameter, and then iterate over/brute-force all combinations of hyperparameter values. With more hyperparameters, this quickly becomes intractable.

Random Search

Instead of brute-forcing all hyperparameter values, we define a marginal distribution for each hyperparameter. Over several iterations, we randomly sample from the distributions, and choose the best configuration.

Model-Based Hyperparameter Optimization

Similar to automatic hyperparameter optimization.

11.5 Debugging Strategies

To debug model performance

Directly observe model on random examples
Directly observe model on worst mistakes

To debug software implementation bugs

If high training error, fit a small dataset (should always fit if no bug)
Compare back-propagated derivatives to numerical derivatives
Create histograms of activations and gradients during training