Chapter 5: Machine Learning Basics

5.1 Learning Algorithms

"A computer program is said to learn from experience $E$ with respect to some class of tasks $T$ and performance measure $P$ , if its performance at tasks in $T$ , as measured by $P$ , improves with experience $E$ ."

The Task, $T$

ML tasks are typically described in terms of an example, a sample collection of quantitative features of an actual object/event. It is typically represented as a vector $\boldsymbol{x}\in \mathbb{R}^{n}$ of features. Some common tasks include

Classification: produce $f:\mathbb{R}^{n}\to \{ 1,\dots,k \}$ .
Classification w/ missing inputs.
Regression: produce $f:\mathbb{R}^{n}\to \mathbb{R}$ .
Transcription: images $\to$ words.
Machine translation: languages.
Structured output: output is a vector with relationships between elements, e.g. sentence
Anomaly detection: unusual behavior/data
Synthesis and sampling: create novel output, variation desired
Data imputation: fill in missing values
Denoising: clean corrupt example, i.e. predict $p(\boldsymbol{x}\mid \boldsymbol{\tilde{x}})$ .
Density estimation or probability mass function estimation: learn $p_{\text{model}}:\mathbb{R}^{n}\to \mathbb{R}$ .

The Performance Measure, $P$

For classification and similar tasks, accuracy is measured. The error rate is what's actually used, and is called the expected $0$ - $1$ loss, where $0$ refers to a correct classification and $1$ refers to an incorrect classification.

For density estimation and similar tasks, metrics like the average log-probability are used.

Additionally, the model is typically trained on a subset of the dataset known as the training data, while testing is performed on the rest of the dataset, known as testing data. This allows evaluating if the model generalizes to real world data.

The Experience, $E$

ML algorithms are broadly categorized as unsupervised (learn properties/structure of unlabeled dataset, i.e. learn $p(\mathbf{x})$ ) or supervised (associate labels with data patterns in labeled dataset, i.e. learn $p(\mathbf{y}\mid \mathbf{x})$ ).

Notably, the chain rule of probability shows that an unsupervised learning problem can be solved by splitting it into $n$ supervised learning problems. Similarly, a supervised learning problem can be solved by solving several related unsupervised learning problems. Thus, they are not entirely separate concepts; however, they do roughly categorize the algorithms we consider.

Other learning algorithms include:

Semi-supervised learning: dataset with labeled and unlabeled data
Multi-instance learning
Reinforcement learning: feedback loop with environment

A dataset can frequently be described with a design matrix, which just consists of a different example in each row. This does, though, require each data point to have the same features. Alternatively, for heterogeneous data points, they may be described as a set containing $m$ vectors of possibly different sizes. Supervised learning is additionally provided a vector of labels for each data point.

Example: Linear Regression

Linear regression attempts to build a linear (or, really, affine) function $f:\mathbb{R}^{n}\to \mathbb{R}$ . Let $\hat{y}$ be the model's prediction. In the linear-only model,

\hat{y}=\boldsymbol{w}^{\top}\boldsymbol{x}

where $\boldsymbol{w}\in \mathbb{R}^{n}$ is a vector of parameters which essentially act as weights for the features.

For the performance matrix, one possible measure is the mean squared error of the model on the test set $\boldsymbol{\hat{y}}^{(test)}$ .

\text{MSE}_{\text{test}}=\frac{1}{m}\sum_{i}(\hat{\boldsymbol{y}}^{(test)}-\boldsymbol{y}^{(test)})^{2}_{i}

\text{MSE}_{\text{test}}=\frac{1}{m}\lVert \hat{\boldsymbol{y}}^{(test)}-\boldsymbol{y}^{(test)} \rVert ^{2}_{2}

Training the model involves minimizing $\text{MSE}_{\text{train}}$ , which requires solving for when the gradient is $0$ .

\nabla_{\boldsymbol{w}}\text{MSE}_{\text{train}}=0

Which eventually leads to

\boldsymbol{w}=(\boldsymbol{X}^{(train)\top}\boldsymbol{X}^{(train)})^{-1}\boldsymbol{X}^{(train)\top}\boldsymbol{y}^{(train)}

Where $\boldsymbol{X}^{(train)}$ is the design matrix of the training data.

Note that the system of equations that leads to this solution is known as the normal equations.

5.2 Capacity, Overfitting and Underfitting

The central challenge of ML is the generalization of a model to unseen input.

This is the motivation behind separating the dataset into a training set and test set, and running the algorithm to minimize training error while evaluating on the test set to observe the test or generalization error, which we also want to minimize (though not directly).

So, how exactly can we relate the two to ensure that minimizing the training error also minimizes the test error? We can turn to some ideas from statistical learning theory.

First, we typically make some assumptions about the data generation—in particular, that the individual data points are sampled i.i.d, independently and identically distributed, from the same probability distribution, known as the data generating distribution $p_{\text{data}}$ .

This immediately ensures that $\mathbb{E}[\text{error}_{\text{train}}]=\mathbb{E}[\text{error}_{\text{test}}]$ for any randomly selected model (not trained, just any arbitrary model). This is simply because the data sampling process was the same for both datasets.

For a trained model, however, this is different since we explicitly minimize the training error. Thus, for a trained model, $\mathbb{E}[\text{error}_{\text{train}}]\leq \mathbb{E}[\text{error}_{\text{test}}]$ . A model's effectiveness is determined by its ability to

Minimize the training error
Minimize the gap between the training and test error

These two factors correspond precisely to two important challenges in ML: underfitting and overfitting. Underfitting describes when the model is not able to sufficiently minimize training error. Overfitting describes when the gap between the training and test error is excessive.

We can control this, though, by changing a model's capacity—a measure of how many functions a model can fit. Low capacity models tend to underfit as it has a limited selection of functions, while high capacity models tend to overfit as it may select a function too specific to the training dataset.

The most intuitive way to control a model's capacity is to control its hypothesis space, the functions that the learning algorithm is allowed to select as a solution. For instance, the linear regression algorithm allows us to choose any affine function. However, we can also produce quadratic, cubic, quartic, or even higher degree functions to expand the model's hypothesis space, and consequently its capacity as well.

Selecting a model's hypothesis space is, in general, not an easy task and often specific to the problem itself and its complexity. For example, here's the fitting of a quadratic dataset by a linear, quadratic, and degree $9$ polynomial, respectively.

Notably, there are some additional terms for capacity. Representational capacity refers to the family of functions a learning algorithm can choose from when varying parameters. However, finding the best function in this family is generally a very difficult optimization problem; thus, the algorithm's effective capacity may be less than the model's representational capacity.

There are also ways to quantify capacity, such as the Vapnik-Chervonenkis dimension (VC dimension), which measures the capacity of a binary classifier (i.e. high VC dimension means high capacity). It is the largest possible value of $m$ that a model can "shatter," i.e. for which there exists a training set of $m$ different $\boldsymbol{x}$ points that the classifier can correctly label. Such quantifiers allow upper-bounding a mode's generalization error; however, they are typically too loose to use in practice for deep learning.

info

Note the following graph of underfitting/overfitting vs. capacity.

So, how high can capacity be? We can reach arbitrarily high levels via non-parametric models. So far, we have only observed parametric models, which learn a function described by a finite, fixed-size parameter vector. In contrast, non-parametric functions can change the size of their parameter vector while observing data, i.e. update the size while training.

One simple non-parametric algorithm is nearest neighbor regression. This algorithm simply stores the entire training set, and when asked to classify an example, simply returns the value of the nearest "neighbor" in the training set. Mathematically, $\hat{y}=y_{i}$ where $i=\arg\min\lVert \boldsymbol{X}_{i}-\boldsymbol{x} \rVert^{2}_{2}$ (or some other norm, even a learned distance metric!).

One could also simply wrap a parametric learning algorithm inside another that changes the number of parameters as needed.

So, how good can these models become? Well, the ideal model is an oracle that knows the true data generating probability distribution itself. However, due to noise in the distribution, it will still incur some error—this error is known as Bayes error. No model can do better than Bayes error, clearly. Non-parametric models asymptote to Bayes error provided more and more data, while fixed parametric models with suboptimal capacity asymptote to an error value higher than Bayes error. (This applies only to test error; training error can become arbitrarily low).

The No Free Lunch Theorem

The No Free Lunch Theorem states that, averaged over all possible data generating distributions, every classification algorithm has the same error rate when classifying previously unobserved points. That is, no ML algorithm should perform universally better than every other algorithm on every arbitrary dataset. Of course, for real world applications, we only really care about the probability distributions we encounter in the real-world, which we can design algorithms for since we're not seeking some universal learning algorithm.

Regularization

Regularization is essentially the process of incentivizing an algorithm to prefer one type of solution over another.

One type of regularization is weight decay, which motivates the algorithm to choose a set of weights $\boldsymbol{w}$ that has a smaller $L^{2}$ norm. In particular, we describe the objective to minimize as $J(\boldsymbol{w})$ .

J(\boldsymbol{w})=\text{MSE}_{\text{train}}+\lambda \boldsymbol{w}^{\top}\boldsymbol{w}

where $\lambda$ is chosen before training; higher values of $\lambda$ encode a stronger preference for small weights. Like capacity tuning, this requires careful selection. Large $\lambda$ could underfit, and small $\lambda$ could overfit.

Choice of

\lambda

The choice of $\lambda$ is actually derived from the generalized Lagrangian we discussed in section 4.4. After placing a constraint of $\boldsymbol{w}^{\top}\boldsymbol{w}\leq T$ for some threshold $T$ , $\lambda$ is derived and then included in the cost function being minimized, as shown in the generalized Lagrangian. Generally speaking, $\lambda \approx \frac{1}{T}$ . (Source).

More generally, we can apply a penalty $\Omega$ known as a regularizer to the cost function. In this case, $\Omega(\boldsymbol{w})=\boldsymbol{w}^{\top}\boldsymbol{w}$ . This is more general than simply including/excluding functions from a hypothesis space, as an exclusion of a function represents an infinitely strong preference against it. Regularizers apply a sliding scale of preference.

Formally, regularization is any modification to a learning algorithm intended to reduce its generalization/test error but not its training error.

info

We discuss regularization in much greater detail in Chapter 7.

5.3 Hyperparameters and Validation Sets

ML algorithms typically have several tunable settings—think capacity, learning rate, etc. These are known as hyperparameters.

A setting is typically chosen because $(1)$ the hyperparameter is difficult for the learning algorithm to optimize itself or, more usually, $(2)$ it doesn't make sense to learn the hyperparameter on the training set. For instance, if capacity was learned, the model would always choose the maximum capacity to minimize training error as best as possible, resulting in overfitting.

Therefore, it's often necessary to use a validation set of data points. The test set can't be used, so the training data is typically split into a training set, for learning model parameters, and a validation set, for tweaking hyperparameters by providing a rough estimate of generalization error. An 80/20 training/validation split is common.

warning

Repeated use of the same test set actually results in optimistic evaluations of the generalization error, as models are optimized for the test set error. Thus, the introduction of new, different benchmarks is important for keeping the generalization error measurement true to the real world!

Cross-Validation

Sometimes, a dataset is too small to easily split into a fixed training and test set, as this would lead to statistical uncertainty about the estimated test error. One solution to this is $k$ -fold cross-validation. In this procedure, a dataset is split into $k$ disjoint subsets. Then, for $k$ trials, the $i$ th subset of the data is used as the test set and the rest of the data is used for training. This, of course, increases computational cost by a factor of $k$ ; however, it results in decent approximations of test error.

5.4 Estimators, Bias, and Variance

Point Estimation

Point estimation attempts to provide the single best prediction of some quantity of interest. In essence, when making a prediction, we simply choose the best option or option associated with the highest probability, as opposed to choosing by randomly sampling from the probabiltiy distribution of the options.

Notation-wise, a point estimate of a parameter $\boldsymbol{\theta}$ is denoted by $\hat{\boldsymbol{\theta}}$ .

Formally, let $\{ \boldsymbol{x}^{(1)},\dots,\boldsymbol{x}^{(m)} \}$ be a set of $m$ i.i.d. data points. A point estimator or statistic is any function of the data

\hat{\boldsymbol{\theta}}_{m}=g(\boldsymbol{x}^{(1)},\dots ,\boldsymbol{x}^{(m)})

Of course, a good estimator minimizes the distance from $\boldsymbol{\theta}$ .

We will now approach point estimation from the frequentist perspective. Frequentist statistics, in short, treats probabilities as equivalent to frequencies over a large dataset. The opposing perspective is Bayesian, which we will discuss in further detail soon. For now, for our purposes, the frequentist perspective means we treat the true $\boldsymbol{\theta}$ , as a fixed, but unknown value. Meanwhile, the point estimate $\hat{\boldsymbol{\theta}}$ is just treated as a function of the data. The most critical distinction, however, is that we treat the data as a random process/sampled from a probability distribution. (And $\boldsymbol{\theta}$ as the parameter underlying this distribution). Therefore, $\hat{\boldsymbol{\theta}}$ is considered a random variable.

Function Estimation

Frequently, we are interested in approximating some function $f(\boldsymbol{x})$ , i.e. predict $\boldsymbol{y}$ given $\boldsymbol{x}$ . We may assume that $\boldsymbol{y}=f(\boldsymbol{x})+\boldsymbol{\epsilon}$ , where $\boldsymbol{\epsilon}$ represents the part of the $\boldsymbol{y}$ that is not predictable from $\boldsymbol{x}$ (e.g. noise). In function estimation, we attempt to approximate $f$ with a model or estimate $\hat{f}$ . In reality, $\hat{f}$ is really just a point estimator in the function space! For example, one may note that linear regression is a constrained version of function estimation.

Bias

An estimator's bias is described by

\text{bias}(\hat{\boldsymbol{\theta}}_{m})=\mathbb{E}(\hat{\boldsymbol{\theta}}_{m})-\boldsymbol{\theta}

Recall, of course, that $\hat{\boldsymbol{\theta}}_{m}$ is treated as a random variable, hence we may take its expectation.

An estimator is said to be unbiased if $\text{bias}(\hat{\boldsymbol{\theta}}_{m})=\mathbf{0}$ , i.e. $\mathbb{E}(\hat{\boldsymbol{\theta}}_{m})=\boldsymbol{\theta}$ . Or, an estimator $\hat{\boldsymbol{\theta}}_{m}$ is said to be asymptotically unbiased if $\lim_{ m \to \infty }\text{bias}(\hat{\boldsymbol{\theta}}_{m})=\mathbf{0}$ , i.e. $\lim_{ m \to \infty }\mathbb{E}(\hat{\boldsymbol{\theta}}_{m})=\boldsymbol{\theta}$ .

Example: Bernoulli Distribution

Consider a set of examples $\{ x^{(1)},\dots,x^{(m)} \}$ sampled i.i.d. from a Bernoulli distribution with mean $\theta$ . (Recall that $x^{(i)}\in \{ 0,1 \},\ \forall i$ ).

P(x^{(i)}; \theta)=\theta^{x^{(i)}}(1-\theta)^{(1-x^{(i)})}

A simple estimator for this distribution is the mean of the training examples.

\hat{\theta}_{m}=\frac{1}{m}\sum_{i=1}^{m} x^{(i)}

Whose expectation we can calculate as

\begin{align*} \mathbb{E}[\hat{\theta}_{m}] &= \mathbb{E}\left[ \frac{1}{m}\sum_{i=1}^{m} x^{(i)} \right] \\ &= \frac{1}{m}\sum_{i=1}^{m} \mathbb{E}[x^{(i)}] \\ &= \frac{1}{m}\sum_{i=1}^{m} \sum_{x^{(i)}\in \{ 0,1 \}} P(x^{(i)};\theta) \\ &= \frac{1}{m}\sum_{i=1}^{m} \theta \\ &= \theta \end{align*}

Thus, the estimator $\hat{\theta}_{m}$ is unbiased.

Example: Gaussian Distribution Estimators

Consider a set of examples $\{ x^{(1)},\dots,x^{(m)} \}$ sampled i.i.d. from a Gaussian distribution $p(x^{(i)})=\mathcal{N}(x^{(i)};\mu,\sigma^{2})$ , where $i\in \{ 1,\dots,m \}$ . Recall

p(x^{(i)};\mu,\sigma^{2}) = \frac{1}{\sqrt{ 2\pi\sigma^{2} }}\exp\left( -\frac{1}{2} \frac{(x^{(i)}-\mu)^{2}}{\sigma^{2}} \right)

A common estimator of $\mu$ is the sample mean

\hat{\mu}_{m}=\frac{1}{m}\sum_{i=1}^{m} x^{(i)}

Again, we can calculate its expectation

\begin{align*} \mathbb{E}[\hat{\mu}_{m}] &= \mathbb{E}\left[ \frac{1}{m}\sum_{i=1}^{m} x^{(i)} \right] \\ &= \frac{1}{m}\sum_{i=1}^{m} \mathbb{E}[x^{(i)}] \\ &= \frac{1}{m}\sum_{i=1}^{m} \mu \\ &= \mu \end{align*}

Thus, the sample mean is an unbiased estimator of the Gaussian mean parameter.

A common estimator of $\sigma^{2}$ is the sample variance

\hat{\sigma}^{2}_{m}=\frac{1}{m}\sum_{i=1}^{m} (x^{(i)}-\hat{\mu}_{m})^{2}

Again, compute expectation.

\begin{align*} \mathbb{E}[\hat{\sigma}^{2}_{m}]&=\mathbb{E}\left[ \frac{1}{m}\sum_{i=1}^{m} (x^{(i)}-\hat{\mu}_{m})^{2} \right] && (1) \\ &= \frac{1}{m}\sum_{i=1}^{m} \mathbb{E}[(x^{(i)}-\hat{\mu}_{m})^{2}] && (2) \\ &= \frac{1}{m}\sum_{i=1}^{m} \mathrm{Var}(x^{(i)}-\hat{\mu}_{m})+(\mathbb{E}[x^{(i)}-\hat{\mu}_{m}])^{2} && (3) \\ \mathrm{Var}(x^{(i)}-\hat{\mu}_{m}) &= \mathrm{Var}(x^{(i)}) + \mathrm{Var}(\hat{\mu}_{m}) -2\mathrm{Cov}(x^{(i)}, \hat{\mu}_{m}) && (4) \\ \mathrm{Var}(x^{(i)}) &= \sigma^{2} && (5) \\ \mathrm{Var}(\hat{\mu}_{m}) &= \mathrm{Var}\left( \frac{1}{m}\sum_{i=1}^{m} x^{(i)} \right) && (6) \\ &= \frac{1}{m^{2}}\mathrm{Var}\left( \sum_{i=1}^{m} x^{(i)} \right) && (7) \\ &= \frac{1}{m^{2}}\left[ \sum_{i=1}^{m} \mathrm{Var}(x^{(i)}) + 2\sum_{i<j}\mathrm{Cov}(x^{(i)}, x^{(j)}) \right] && (8) \\ \mathrm{Cov}(x^{(i)}, x^{(j)}) &= 0 && (9) \\ \mathrm{Var}(\hat{\mu}_{m}) &= \frac{1}{m^{2}}\sum_{i=1}^{m} \sigma^{2}=\frac{\sigma^{2}}{m} && (10) \\ \mathrm{Cov}(x^{(i)}, \hat{\mu}_{m}) &= \mathrm{Cov}\left( x^{(i)}, \frac{1}{m}\sum_{j=1}^{m} x^{(j)} \right) && (11) \\ &= \frac{1}{m} \sum_{j=1}^{m} \mathrm{Cov}(x^{(i)}, x^{(j)}) && (12) \\ &= \frac{1}{m}\mathrm{Cov}(x^{(i)}, x^{(i)}) && (13) \\ &= \frac{1}{m}\mathrm{Var}(x^{(i)}) && (14) \\ &= \frac{\sigma^{2}}{m} && (15) \\ \mathbb{E}[x^{(i)}-\hat{\mu}_{m}] &= \mathbb{E}[x^{(i)}]-\mathbb{E}[\hat{\mu}_{m}] && (16) \\ &= \mu-\mu=0 && (17) \\ \mathbb{E}[\hat{\sigma}^{2}_{m}] &= \frac{1}{m}\sum_{i=1}^{m} \left( \sigma^{2}+\frac{\sigma^{2}}{m}-\frac{2\sigma^{2}}{m} \right)=\boxed{ \frac{m-1}{m}\sigma^{2} } && (18) \end{align*}

Sorry for how... intimidating the above proof looks like. There's no need to concern yourself too much with the details of the derivation, but here's an explanation of all the nontrivial steps for completeness and your curiosity.

$\mathrm{Var}(X)=\mathbb{E}[X^{2}]-(\mathbb{E}[X])^{2}$ .
$\mathrm{Var}(X-Y)=\mathrm{Var}(X)+\mathrm{Var}(Y)-2\mathrm{Cov}(X, Y)$ .
$x^{(i)}$ is sampled from the Gaussian distribution, and thus has variance $\sigma^{2}$ .
Substituting $\hat{\mu}_{m}=\frac{1}{m}\sum_{i=1}^{m}x^{(i)}$ .
$\mathrm{Var}(\alpha X)=\alpha^{2}\mathrm{Var}(X)$ .
$\mathrm{Var}(X+Y)=\mathrm{Var}(X)+\mathrm{Var}(Y)+2\mathrm{Cov}(X, Y)$ , but applied over a large sum.
$x^{(i)}$ and $x^{(j)}$ are i.i.d, which includes independence $\implies$ covariance is $0$ .
Result of $(5)$ and $(9)$ substituted into $(8)$ .
Substituting $\hat{\mu}_{m}=\frac{1}{m}\sum_{i=1}^{m}x^{(i)}$ .
Linearity of Covariance.
Same reason as $(9)$ .
$\mathrm{Cov}(x^{(i)}, x^{(i)})=\mathrm{Var}(x^{(i)})$ .
Same reason as $(5)$ .
Linearity of Expectation.
Same reason as $(5)$ for $\mathbb{E}[x^{(i)}]$ , and $\mathbb{E}[\hat{\mu}_{m}]$ was just computed before.
Final step, combining everything together and simplifying.

Also, note that there is a nicer derivation that relies on a clever algebraic trick!

In conclusion, though, the sample variance is a biased estimator, with a bias of

\mathbb{E}[\hat{\sigma}_{m}^{2}]-\sigma^{2}=-\frac{\sigma^{2}}{m}

And consequently, we must apply what is known as Bessel's correction to obtain an unbiased sample variance estimator.

\tilde{\sigma}_{m}^{2}=\frac{m}{m-1}\hat{\sigma}_{m}^{2}=\frac{1}{m-1}\sum_{i=1}^{m} (x^{(i)}-\hat{\mu}_{m})^{2}

\begin{align*} \mathbb{E}[\tilde{\sigma}_{m}^{2}] &= \frac{m}{m-1}\mathbb{E}[\hat{\sigma}_{m}^{2}] \\ &= \frac{m}{m-1}\cdot \frac{m-1}{m}\sigma^{2} \\ &= \sigma^{2} \end{align*}

Intuition

You may be wondering—why is this correction needed? The book doesn't discuss this, but the idea is that the sample variance has only $n-1$ degrees of freedom.

The sample variance is calculated using the sample mean, which is the center of the sample data. Before calculating the sample mean, you have $n$ independent pieces of information, your $n$ data points. However, after calculating the center, one might note that the data points themselves are not all independent with respect to the center! After all, the center is calculated using the $n$ data points. In fact, you really have $n-1$ independent pieces of information points now—if you disregard one piece of information, then the center appears independent to the rest. In other words, $n-1$ degrees of freedom.

Alternatively, consider the fact that the sample mean is the variance-minimizing value. If the variance of the sample is calculated with respect to the population mean, it will certainly be greater than or equal to the variance with respect to the sample mean. Therefore, dividing by $n-1$ instead of $n$ applies a small correcting factor. And the choice of $n-1$ is motivated by the degrees of freedom.

Variance and Standard Error

It's also often useful to compute the variance or standard error (standard deviation) of an estimator. An ideal estimator should exhibit low variance.

warning

Note that we are computing the variance of the estimator. This is not the same as computing an estimator for the variance of the Gaussian, as we did above.

The standard error of the sample mean estimator is

\text{SE}(\hat{\mu}_{m})=\sqrt{ \mathrm{Var}\left[ \frac{1}{m}\sum_{i=1}^{m} x^{(i)} \right] }=\frac{\sigma}{\sqrt{ m }}

Unfortunately, neither $\sqrt{ \hat{\sigma}_{m}^{2} }$ or $\sqrt{ \tilde{\sigma}_{m}^{2} }$ can substitute for $\sigma$ and provide an unbiased estimator of the standard error. However, they are both used in practice, as the square root of the unbiased estimator of variance is a small underestimate, especially for large $m$ .

Also, consider, for instance, the variance of the estimator $\hat{\theta}_{m}=\frac{1}{m}\sum_{i=1}^{m}x^{(i)}$ of a Bernoulli distribution. (Derivation left out).

\mathrm{Var}(\hat{\theta}_{m})=\frac{1}{m}\theta(1-\theta)

The variance of this estimator decreases as a function of $m$ , the number of data samples; this is common in popular estimators.

Balancing Bias and Variance to Minimize MSE

The bias and variance of an estimator measure two different sources of error. Bias measures the expected deviation from the true value. Variance measures the expected deviation from the expected estimator value. It's desirable to minimize both, of course, but tradeoffs exist.

Cross-validation is often used to balance the two. Another method is choosing to minimize the mean squared error, as it actually is a combination of the two metrics.

\begin{align*} \text{MSE}&=\mathbb{E}[(\hat{\theta}_{m}-\theta)^{2}] \\ &= \text{Bias}(\hat{\theta}_{m})^{2}+\mathrm{Var}(\hat{\theta}_{m}) \end{align*}

The below diagram shows, for instance, how balancing bias and variance tends to result in minimal generalization error. It also shows the relationship between underfitting/overfitting with bias/variance.

Consistency

Finally, we discuss a desirable property of estimators: consistency, a measure of estimator behavior as the amount of training data, $m$ , grows. An estimator is consistent if

\mathrm{plim}_{m\to \infty}\ \hat{\theta}_{m}=\theta

where $\text{plim}$ means that, for any $\epsilon>0$ , $P(\lvert \hat{\theta}_{m}-\theta \rvert>\epsilon)\to0$ as $m\to0$ . This is actually sometimes referred to as weak consistency, with strong consistency meaning almost sure convergence of $\hat{\theta}$ to $\theta$ , i.e. when $p(\lim_{ m \to \infty }\mathbf{x}^{(m)}=x)=1$ for a sequence of random variables $\mathbf{x}^{(1)},\dots$ converging to $x$ .

Consistency implies that bias diminishes as $m$ , the dataset size, grows (i.e. asymptotic unbiasedness). Though, one should note that the converse is not true. (e.g. Consider an estimator that bases its estimate off of only the first sample, regardless of $m$ ).

5.5 Maximum Likelihood Estimation

The maximum likelihood principle helps guide the derivation of good estimators for different models.

Consider a set of $m$ examples $\mathbb{X}=\{ x^{(1)},..,x^{(m)} \}$ independently sampled from the true, but unknown data generating distribution $p_{\text{data}}(\mathbf{x})$ . Let $p_{\text{model}}(\mathbf{x};\boldsymbol{\theta})$ be a parametric family of probability distributions over the same space, where $\boldsymbol{\theta}$ are the parameters (that effectively choose which distribution is used). That is, for some sample $\boldsymbol{x}$ , $p_{\text{model}}(\boldsymbol{x};\boldsymbol{\theta})$ serves as an estimator for this sample's true probability of being sampled from the distribution, $p_{\text{data}}(\boldsymbol{x})$ .

Then, we define the maximum likelihood estimator (MLE) for $\theta$ as

\begin{align*} \boldsymbol{\theta}_{\text{ML}} &= \underset{ \boldsymbol{\theta} }{ \arg\max }\ p_{\text{model}}(\mathbb{X};\boldsymbol{\theta}) \\ &= \underset{ \boldsymbol{\theta} }{ \arg\max }\ \prod_{i=1}^{m} p_{\text{model}}(\boldsymbol{x}^{(i)};\boldsymbol{\theta}) \\ \end{align*}

What does this mean? Well, for each sample $\boldsymbol{x}$ from the set of examples $\mathbb{X}$ , it computes its estimated probability from the model, and then takes the aggregate product. $\boldsymbol{\theta}_{\text{ML}}$ is just the maximizing parameter, i.e. the parameter that maximizes this product.

How does this work?

We formalize the actual reasoning behind the MLE below. But, for intuition, think about a simple example. Let $\mathbb{X}=\{ 0,1,1,1 \}$ . From the sample, it seems that the true distribution is biased towards $1$ . The MLE formula accounts for this, as $\boldsymbol{\theta}_{\text{ML}}=\underset{ \boldsymbol{\theta} }{ \arg\max }\ [(p_{\text{model}}(0;\boldsymbol{\theta}))(p_{\text{model}}(1;\boldsymbol{\theta}))^{3}]$ . Therefore, the MLE will naturally assign a higher estimated probability to $1$ , as is implied by the sample dataset.

Due to the imprecision of floating point, and thus the possibility for numerical underflow, MLE is typically computed using logarithms instead.

\boldsymbol{\theta}_{\text{ML}}=\underset{ \boldsymbol{\theta} }{ \arg\max }\sum_{i=1}^{m} \log p_{\text{model}}(\boldsymbol{x},^{(i)};\boldsymbol{\theta})

This doesn't change the arg max because $\log$ is a monotonically increasing function. Rescaling the cost function also doesn't change the arg max, so this is equivalent to

\begin{align*} \boldsymbol{\theta}_{ML}&=\underset{ \boldsymbol{\theta} }{ \arg\max }\ \frac{1}{m}\sum_{i=1}^{m} \log p_{\text{model}}(\boldsymbol{x}^{(i)};\boldsymbol{\theta}) \\ &=\underset{ \boldsymbol{\theta} }{ \arg\max }\ \mathbb{E}_{\mathbf{x}\sim \hat{p}_{\text{data}}}\log p_{\text{model}}(\boldsymbol{x}^{(i)};\boldsymbol{\theta}) \\ \end{align*}

Where $\hat{p}_{\text{data}}$ is the empirical distribution defined by the training data.

Now, recall KL divergence from section 3.13.

D_{\text{KL}}(P\parallel Q)=\mathbb{E}_{\mathbf{x}\sim P}[\log P(\boldsymbol{x})-\log Q(\boldsymbol{x})]

We can apply this with $P=\hat{p}_{\text{data}}$ and $Q=p_{\text{model}}$ .

D_{\text{KL}}(\hat{p}_{\text{data}}\parallel p_{\text{model}})=\mathbb{E}_{\mathbf{x}\sim \hat{p}_{\text{data}}}[\log \hat{p}_{\text{data}}(\boldsymbol{x})-\log p_{\text{model}}(\boldsymbol{x})]

The first term is a function only of the data generating process, and is thus not affected by the model or the parameter $\boldsymbol{\theta}$ at all. Critically, though, notice how the second term $-\mathbb{E}_{\mathbf{x}\sim \hat{p}_{\text{data}}}\log p_{\text{model}}(\boldsymbol{x})$ (after applying Linearity of Expectation) is precisely the expression in the arg max of the MLE! Thus, computing $\boldsymbol{\theta}_{\text{ML}}$ , and thus maximizing the expression $\mathbb{E}_{\mathbf{x}\sim p_{\hat{}_{\text{data}}}}\log p_{\text{model}}(\boldsymbol{x})$ , is equivalent to minimizing the KL divergence between the empirical distribution $\hat{p}_{\text{data}}$ (really, the training set) and the model's distribution $p_{\text{model}}$ .

This is also the same as minimizing cross-entropy, which, if you recall from section 3.13, is defined as

H(P,Q)=-\mathbb{E}_{\mathbf{x}\sim P}\log Q(\boldsymbol{x})

In fact, any loss expressed as a negative log-likelihood (NLL) (log-likelihood is $\log p(\boldsymbol{x})$ ) is really just a measure of the cross-entropy between the empirical distribution defined by the training set and the probability distribution defined by the model. MSE, for instance, is the cross-entropy between the empirical distribution and a Gaussian model.

Reread the above paragraph! Make sure you note its conclusion.

Thus, we observe that the maximum likelihood estimator is simply matching the model distribution to the empirical distribution as much as possible, which itself is an approximation of the true data generating distribution (which we, of course, cannot access—otherwise the problem's already done :P).

In software, we typically minimize a cost function—thus, in practice, we follow the maximum likelihood principle by minimizing the NLL/cross-entropy.

Conditional Log-Likelihood and MSE

MLE can be generalized to the prediction of a conditional probability $P(\mathbf{y}\mid \mathbf{x};\boldsymbol{\theta})$ . This is important because it's actually what's desired in supervised learning.

Let $\boldsymbol{X}$ represent all our inputs and $\boldsymbol{Y}$ represent all of our observed targets/labels. Then, the conditional MLE is

\boldsymbol{\theta}_{\text{ML}}=\underset{ \boldsymbol{\theta} }{ \arg\max }\ P(\boldsymbol{Y}\mid \boldsymbol{X};\boldsymbol{\theta})

If the examples are assumed i.i.d., then this is equivalent to

\boldsymbol{\theta}_{\text{ML}}=\underset{ \boldsymbol{\theta} }{ \arg\max }\ \sum_{i=1}^{m} \log P(\boldsymbol{y}^{(i)}\mid \boldsymbol{x}^{(i)};\boldsymbol{\theta})

Again, we observe the implicit negative log-likelihood.

MLE: Linear Regression

With this, we can motivate the choice of MSE as the cost function for linear regression.

We consider the model as producing a conditional distribution $p(y\mid \boldsymbol{x})$ , and imagine an infinitely large training set, where they may exist several samples with the same input value $\boldsymbol{x}$ but different outputs $y$ . We define $p(y\mid \boldsymbol{x})=\mathcal{N}(y;\hat{y}(\boldsymbol{x};\boldsymbol{w}),\sigma^{2})$ , where $\hat{y}$ predicts the mean of the Gaussian. This is because we expect the different outputs $y$ to be normally distributed around the trend's prediction. Since the examples are assumed to be i.i.d., the conditional log likelihood is given by

\sum_{i=1}^{m} \log p(y^{(i)}\mid \boldsymbol{x}^{(i)};\boldsymbol{\theta})

which simplifies to

-m\log\sigma-\frac{m}{2}\log (2\pi)-\sum_{i=1}^{m} \frac{\lVert \hat{y}^{(i)}-y^{(i)} \rVert ^{2}}{2\sigma^{2}}

Where $\hat{y}^{(i)}$ is the output of the linear regression on the $i$ th input $\boldsymbol{x}^{(i)}$ . Recall that the MSE is

\text{MSE}_{\text{train}}=\frac{1}{m}\sum_{i=1}^{m} \lVert \hat{y}^{(i)}-y^{(i)} \rVert ^{2}

and we observe that minimizing the MSE is equivalent to maximizing the conditional log likelihood or minimizing the negative conditional log-likelihood, which, as explained previously, produces the maximum likelihood estimator!

Properties of Maximum Likelihood

The primary appeal of MLE is that it is proven to be the best estimator asymptotically, as $m\to \infty$ , in terms of rate of convergence. Under certain conditions, MLE is also consistent. In particular, the conditions are

The true distribution $p_{\text{data}}$ must lie within the model family $p_{\text{model}}(\cdot;\boldsymbol{\theta})$ .
The true distribution $p_{\text{data}}$ must correspond to exactly one value of $\boldsymbol{\theta}$ .

There are other consistent estimators. However, they all differ in statistic efficiency—essentially, rate of convergence. It's typically studied in the parametric case, where the MSE of the parameter itself (the estimated vs true parameter) is computed. The Cramer-Rao lower bound shows that no consistent estimator has a lower parameter MSE than the MLE. Thus, maximum likelihood is often the preferred estimator for machine learning.

info

With smaller datasets, regularization strategies may be applied to increase the bias of MLE but reduce variance.

5.6 Bayesian Statistics

The previous discussions have all been based on frequentist statistics, which make a single estimate of a parameter $\boldsymbol{\theta}$ and make predictions thereafter based on the estimate. The Bayesian approach involves, instead, considering all possible values of $\boldsymbol{\theta}$ when making a prediction. In short, instead of considering the dataset as a random variable sampled from some distribution and the true underlying parameter as some fixed value; Bayesian statistics considers the dataset as fixed, and treats the parameter $\boldsymbol{\theta}$ as a random variable, i.e. treats $\boldsymbol{\theta}$ as if it has an underlying probability distribution.

info

Bayesian statistics is the more intuitive approach, i.e. you've probably unconsciously used the Bayesian approach in daily life! Recall that frequentist statistics equates probability with long-run frequencies. Bayesian statistics treats probabilities as beliefs, and updates those beliefs as new data becomes available.

Before observing the data, we represent our knowledge of $\boldsymbol{\theta}$ with the prior probability distribution $p(\boldsymbol{\theta})$ . Generally, the prior is selected to have high entropy to reflect the high degree of uncertainty regarding $\boldsymbol{\theta}$ , such as a Gaussian or uniform distribution.

example

For some real-world distributions, it might be safe to assume $\boldsymbol{\theta}$ lies in some known range, prior to observing any data.

Many priors are designed to reflect a preference for "simpler" solutions, e.g. smaller parameters.

Now, consider the samples $\{ x^{(1)},\dots,x^{(m)} \}$ from the dataset. The observation of this data should affect our beliefs about the distribution underlying $\boldsymbol{\theta}$ ; we can compute this via Bayes' rule.

p(\boldsymbol{\theta}\mid x^{(1),\dots ,x^{(m)}})= \frac{p(x^{(1)},\dots ,x^{(m)}\mid\boldsymbol{\theta})p(\boldsymbol{\theta})}{p(x^{(1)},\dots ,x^{(m)})}

Note that $p(x^{(1)},\dots,x^{(m)}\mid \boldsymbol{\theta})$ represents the probability of this dataset appearing if our belief about $\boldsymbol{\theta}$ parameterized the probability distribution sampled from. Regardless, the observation of the data induces updates in our posterior probability distribution, i.e. the distribution after observing data, to concentrate around the more likely values of the parameter $\boldsymbol{\theta}$ . This is $p(\boldsymbol{\theta}\mid x^{(1)},\dots,x^{(m)})$ .

Then, after observing the $m$ samples, we can perform predictions on what another ( $m+1$ th) sample would be!

p(x^{(m+1)}\mid x^{(1)},\dots ,x^{(m)})=\int p(x^{(m+1)}\mid\boldsymbol{\theta})p(\boldsymbol{\theta}\mid x^{(1)},\dots ,x^{(m)}) \,\mathrm{d}\boldsymbol{\theta}

In essence, $\boldsymbol{\theta}$ is not just one fixed value; we consider each possible value of $\boldsymbol{\theta}$ with positive probability density, and add its contribution of the probability to the calculation, weighted by the posterior density $p(\boldsymbol{\theta}\mid x^{(1)},\dots,x^{(m)})$ . Thus, any uncertainty about $\boldsymbol{\theta}$ is considered in every prediction made.

This is in contrast to the frequentist approach, which addresses uncertainty by attempting to minimize variance (and thus minimize uncertainty).

Additionally, unlike the frequentist approach, the Bayesian approach allows the selection of a prior distribution, enabling the expression of preferences like small parameters or smoother models.

However, while Bayesian methods minimize generalization error much better when training data is limited, they incur high computational costs when training data grows large.

Bayesian Linear Regression

I will leave this out for brevity. Refer to the book for a detailed derivation.

Maximum A Posteriori Estimation

Unfortunately, making predictions using the entire Bayesian posterior distribution is generally intractable for large datasets. Therefore, it's common to derive a single point estimate from the Bayesian approach: maximum a posteriori (MAP) is one such method, which simply chooses the $\boldsymbol{\theta}$ with maximal posterior probability density.

\boldsymbol{\theta}_{\text{MAP}}=\underset{ \boldsymbol{\theta} }{ \arg\max }\ p(\boldsymbol{\theta}\mid \boldsymbol{x}) = \underset{ \boldsymbol{\theta} }{ \arg\max }\log p(\boldsymbol{x}\mid\boldsymbol{\theta})+\log p(\boldsymbol{\theta})

On the RHS, the two terms correspond to the usual log-likelihood term $\log p(\boldsymbol{x}\mid\boldsymbol{\theta})$ and the prior distribution $\log p(\boldsymbol{\theta})$ .

As with full Bayesian inference, MAP Bayesian inference may take advantage of the preferences expressed by the prior. For instance, if the prior is $\mathcal{N}\left( \boldsymbol{w};\mathbf{0}, \frac{1}{\lambda}\boldsymbol{I}^{2} \right)$ , then the log-prior term is proportional to $\lambda \boldsymbol{w}^{\top}\boldsymbol{w}$ weight decay penalty, and acts the same. This information reduces the variance of the MAP point estimate, relative to MLE, but increases bias.

Actually, some regularized estimation strategies (e.g. weight decay) can often be interpreted as a MAP approximation to Bayesian inference, at least when the regularization involves adding an extra term to the objective function that can correspond proportionally to $\log p(\boldsymbol{\theta})$ . This does mean, though, that MAP Bayesian inference provides a more straightforward method of designing complicated regularization terms.

5.7 Supervised Learning Algorithms

Probabilistic Supervised Learning

Most supervised learning algorithms estimate the probability distribution $p(y\mid \boldsymbol{x})$ . Maximum likelihood estimation does this for us, finding the best parameter vector $\boldsymbol{\theta}$ for the parametric family of distributions $p(y\mid \boldsymbol{x};\boldsymbol{\theta})$ .

We've already seen the application of MLE to linear regression, i.e.

p(y\mid \boldsymbol{x};\boldsymbol{\theta})=\mathcal{N}(y;\boldsymbol{\theta}^{\top}\boldsymbol{x},\boldsymbol{I})

and we can generalize it to the classification scenario by defining a different family of probability distributions.

Let there be two classes, $c_{0}$ and $c_{1}$ . As this is a binary classification problem, we only need to define the probability of one class; the other is complementary.

We can transform linear regression into a binary classifier by simply feeding the output of the linear function to the logistic sigmoid function and interpret the output of the logistic sigmoid as a probability (since its range is $(0,1)$ ).

p(y=1\mid \boldsymbol{x};\boldsymbol{\theta})=\sigma(\boldsymbol{\theta}^{\top}\boldsymbol{x})

This is known as logistic regression (but is used for binary classification).

With linear regression, it was possible to directly solve for the optimal weights. With logistic regression, there is no closed-form solution for the optimal parameters; instead, we minimize NLL via gradient descent.

The same strategy—identifying a parametric family of conditional probability distributions with the right input and output—extends to practically any supervised learning problem.

Support Vector Machines

Support vector machines (SVMs) are also based on a linear function $\boldsymbol{w}^{\top}\boldsymbol{x}+b$ . However, in contrast to logistic regression, it outputs only a class, not a probability. An SVM predicts the positive class when $\boldsymbol{w}^{\top}\boldsymbol{x}+b>0$ , and similarly for the negative class.

The kernel trick is a particularly interesting innovation behind SVMs. It relies on the observation that many ML algorithms can be written exclusively in terms of dot products between examples. The SVM, for instance, can be re-written as

\boldsymbol{w}^{\top}\boldsymbol{x}+b=b+\sum_{i=1}^{m} \alpha_{i}\boldsymbol{x}^{\top}\boldsymbol{x}^{(i)}

Where $\boldsymbol{x}^{(i)}$ is training example $i$ and $\boldsymbol{\alpha}$ is a vector of coefficients. Note that the expression $\boldsymbol{\alpha}_{i}\boldsymbol{x}^{\top}\boldsymbol{x}^{(i)}$ is essentially measuring the similarity between this training example $\boldsymbol{x}^{(i)}$ and the input being evaluated $\boldsymbol{x}$ , and $\boldsymbol{\alpha}_{i}$ dictates how much this should be weighted for the training example $\boldsymbol{x}^{(i)}$ .

This small change allows replacing $\boldsymbol{x}$ with the output of a given feature function $\phi(\boldsymbol{x})$ and replacing the dot product with a function $k(\boldsymbol{x},\boldsymbol{x}^{(i)})=\phi(\boldsymbol{x})\cdot \phi(\boldsymbol{x}^{(i)})$ known as a kernel.

warning

Note that $\cdot$ may not always be the vector inner product/dot product. In infinite dimensional spaces, for example, other kinds of inner products are substituted.

Thus, we have

f(\boldsymbol{x})=b+\sum_{i}\alpha_{i}k(\boldsymbol{x},\boldsymbol{x}^{(i)})

which is nonlinear with respect to $\boldsymbol{x}$ , but linear with respect to $\phi(\boldsymbol{x})$ and $\boldsymbol{\alpha}$ .

So what does this really do? Well, this kernel-based function is equivalent to preprocessing the data by applying $\phi(\boldsymbol{x})$ to all inputs, and then training a linear model in the new, transformed space. This is important for two reasons.

It allows learning models nonlinear in $\boldsymbol{x}$ using efficient convex optimization techniques, as we consider $\phi$ fixed and optimize only $\boldsymbol{\alpha}$ . In particular, $\phi$ may be some nonlinear transformation on the original dataset, i.e. brings a sample $\boldsymbol{x}$ in the original input space to a new input space in which the problem solution is linearly representable.
The kernel function $k$ can often be computed more efficiently than the construction and subsequent dot product of two $\phi(\boldsymbol{x})$ vectors. In essence, $k(\boldsymbol{x},\boldsymbol{x}^{(i)})=\phi(\boldsymbol{x})\cdot \phi(\boldsymbol{x}^{(i)})$ for all inputs, but it is implemented differently!

tip

I don't think the book explained the functionality or advantages of the kernel trick particularly well, as a beginner in this field. This Reddit post helped me understand. Wikipedia's explanation is also good.

The most common kernel is the Gaussian kernel.

k(\boldsymbol{u},\boldsymbol{v})=\mathcal{N}(\boldsymbol{u}-\boldsymbol{v};0,\sigma^{2}\boldsymbol{I})

It's also known as the radial basis function (RBF) kernel, since its "value decreases along lines in $\boldsymbol{v}$ space radiating outward from $\boldsymbol{u}$ ." Also, note that the RBF corresponds to a dot product in an infinite-dimensional space.

If you're confused by the above, don't worry—just think of the RBF as giving higher weights to closer points. That is, the Gaussian kernel template matches test examples to training examples for prediction. Each training example becomes a template for its label, and, for a test example, the labels of nearby training examples (according to $L_{2}$ norm) are given higher weights.

Why is it named the Radial Basis Function?

You can think of this kernel as taking an input $\boldsymbol{v}$ and an example in the training set $\boldsymbol{u}$ , and mapping the input $\boldsymbol{v}$ into its distance from $\boldsymbol{u}$ . In 2D, this can be interpreted as a transformation from the original input space into a polar space with the origin at $\boldsymbol{v}$ —at which point $k(\boldsymbol{u},\boldsymbol{v})$ is simply equivalent to the radius corresponding to $\boldsymbol{v}$ . In $n$ dimensions, the interpretation is analogous.

The SVM is not the only algorithm enhanced by the kernel trick, and instead belongs to a larger class of kernel machines or kernel methods.

Kernel machines have some major drawbacks, however.

For instance, the cost of making a prediction is linear in the number of training examples, due to $\sum_{i}\alpha_{i}k(\boldsymbol{x},\boldsymbol{x}^{(i)})$ . SVMs are able to mitigate this somewhat by learning a sparse $\boldsymbol{\alpha}$ vector, i.e. one with few nonzero elements. Then, classifying a new example only requires evaluating the kernel function for the training examples with nonzero $\alpha_{i}$ ; these examples are known as the support vectors, and motivate the nomenclature.

Other Simple Supervised Learning Algorithms

We briefly covered nearest neighbor regression previously. $k$ -nearest neighbors (kNN) is the family of techniques that extends this algorithm, and is a non-parametric, non-probabilistic supervised learning algorithm. $k$ -nearest neighbors, intuitively, doesn't even really learn parameters; instead, during prediction, it simply finds the $k$ -nearest neighbors and averages the corresponding $y$ values in the training set. For classification, we typically use one-hot encoding, i.e. a vector $\boldsymbol{c}$ with $c_{y}=1$ and $c_{i}=0$ for all other $i$ represents the class $y$ . The average of the one-hot codes then gives a probability distribution over the classes.

$k$ -nearest neighbors can obtain very high accuracy, and it converges to the Bayes error rate as $m$ , the training set size, increases. However, it has high computational cost, and generalizes badly with a small training set. Moreover, it cannot distinguish which features are important (i.e. it can't really do what PCA does in identifying the "principal components"). Consider, for instance, a regression task with $\boldsymbol{x}\in \mathbb{R}^{100}$ but only $x_{1}$ matters for the output. $k$ -nearest neighbors will not recognize this, and would require a large dataset to start achieving high accuracy.

A decision tree is another simple supervised learning algorithm. A diagram of one is displayed below.

The top half displays the decision tree itself, while the bottom half displays how the decision tree splits the input space. If you're aware of what a K-D tree is, it's similar to that! Essentially, each leaf node corresponds to a category/output, while intermediate nodes correspond to the lines that split the input space. For any test example, it suffices to find the region it corresponds to in the input space by walking the tree, and then returning the output associated with that input space. Decision trees, though, while great for resource-constrained environments for certain types of problems, struggle to solve some problems even easy for logistic regression.

5.8 Unsupervised Learning Algorithms

Principal Component Analysis

PCA, at its core, learns which "axes" of information are more important in the data—allowing it to compress dimensionality by keeping only the most important data.

The directions of greatest variance are the principal components, and represent the axes of the new space PCA will transform the data into, in order to compress the data while preserving the most information.

Let some original data representation be $\boldsymbol{X}$ , and $m\times n$ design matrix. Assume that the data has mean of zero, i.e. $E[\boldsymbol{x}]=\mathbf{0}$ . If not, the data can be easily demeaned by subtracting the mean from all examples.

The unbiased sample covariance matrix associated with $\boldsymbol{X}$ is described by

\mathrm{Var}(\boldsymbol{x}) =\frac{1}{m-1}\boldsymbol{X}^{\top}\boldsymbol{X}

PCA effectively performs a linear transformation to a new representation $\boldsymbol{z}=\boldsymbol{x}^{\top}\boldsymbol{W}$ , where $\mathrm{Var}(\boldsymbol{z})$ is diagonal and $\boldsymbol{W}$ is the matrix of the principal components. Note that, typically, $\boldsymbol{x} \in \mathbb{R}^{m}$ , $\boldsymbol{W}\in \mathbb{R}^{m}\times \mathbb{R}^{n}$ , and $\boldsymbol{z}\in \mathbb{R}^{n}$ , where $m>n$ . This provides the desired dimensionality reduction! The principal components are also the columns of $\boldsymbol{W}$ , corresponding to the $n$ new axes, as opposed to the $m$ dimensions of $\boldsymbol{x}$ .

Notably, the principal components of a design matrix $\boldsymbol{X}$ are given by the eigenvectors of $\boldsymbol{X}^{\top}\boldsymbol{X}$ . (Proof left out). Thus,

\boldsymbol{X}^{\top}\boldsymbol{X}=\boldsymbol{W}\boldsymbol{\Lambda}\boldsymbol{W}^{\top}

Alternatively, they can also be obtained via SVD. In particular, the principal components are the right singular vectors of $\boldsymbol{X}$ .

\boldsymbol{X}^{\top}\boldsymbol{X}=(\boldsymbol{U}\boldsymbol{\Sigma}\boldsymbol{W}^{\top})^{\top}(\boldsymbol{U}\boldsymbol{\Sigma}\boldsymbol{W}^{\top})=\boldsymbol{W}\Sigma^{2}\boldsymbol{W}^{\top}

The above simplification results from $\boldsymbol{U}$ 's orthogonality implying $\boldsymbol{U}^{\top}\boldsymbol{U}=\boldsymbol{I}$ .

SVD also helps show that PCA provides a diagonal $\mathrm{Var}(\boldsymbol{z})$ . Noting that $\boldsymbol{z}=\boldsymbol{x}^{\top}\boldsymbol{W}$ means $\boldsymbol{Z}=\boldsymbol{X}\boldsymbol{W}$

\begin{align*} \mathrm{Var}(\boldsymbol{z}) &=\frac{1}{m-1}\boldsymbol{Z}^{\top}\boldsymbol{Z} \\ &= \frac{1}{m-1}\boldsymbol{W}^{\top}\boldsymbol{X}^{\top}\boldsymbol{X}\boldsymbol{W} \\ &= \frac{1}{m-1}\boldsymbol{W}^{\top}\boldsymbol{W}\boldsymbol{\Sigma}\boldsymbol{W}^{\top}\boldsymbol{W} \\ &= \frac{1}{m-1}\boldsymbol{\Sigma}^{2} \end{align*}

Again, relying on $\boldsymbol{W}$ 's orthogonality for the last step.

The diagonal $\mathrm{Var}(\boldsymbol{z})$ , meanwhile, implies that the new $\boldsymbol{z}$ -representation of the original data via the linear transformation $\boldsymbol{W}$ has a diagonal covariance matrix, implying the individual elements of $\boldsymbol{z}$ are mutually uncorrelated. In other words, PCA is able to disentangle the unknown factors of variation underlying the data.

$k$ -means Clustering

$k$ -means clustering partitions the training set into $k$ clusters of examples that are near each other (clustering together in the input space). For each input $\boldsymbol{x}$ , it provides a $k$ -dimensional one-hot code vector $\boldsymbol{h}$ such that $h_{i}=1$ and all other entries are zero if $\boldsymbol{x}$ is in cluster $i$ .

The algorithm is simple. First, it initializes $k$ different centroids (cluster centers) $\{ \boldsymbol{\mu}^{(1)},\dots,\boldsymbol{\mu}^{(k)} \}$ . Then, it repeats the following two steps until convergence.

Assign each training example to the nearest centroid $\boldsymbol{\mu}^{(i)}$ 's cluster $i$ .
Update $\boldsymbol{\mu}^{(i)}$ to the mean of all training examples assigned to cluster $i$ .

Note that, in general, $k$ -means clustering is an NP-Complete problem; however, this algorithm efficiently finds a good approximation for the optimal solution.

5.9 Stochastic Gradient Descent

Stochastic gradient descent (SGD) is an incredibly important deep learning algorithm, and serves as an efficient extension of the classical gradient descent.

Frequently, the cost function of an ML algorithm decomposes as a sum over all the training examples. At each step of gradient descent, the gradient of this cost function must be computed, which has complexity $O(m)$ , linear in the number of samples. As the training set grows, this quickly becomes prohibitively slows.

SGD takes advantage of the fact that the gradient is an expectation, and thus may be estimated using only a small subset of samples. For each step, SGD samples a minibatch of examples $\mathbb{B}=\{ x^{(1)},\dots,x^{(m')} \}$ from the training set where $m'$ is usually some fixed number much smaller than the training set size. Then, the gradient is estimated as

\boldsymbol{g}=\frac{1}{m'}\nabla_{\boldsymbol{\theta}}\sum_{i=1}^{m'} L(\boldsymbol{x}^{(i)},y^{(i)},\boldsymbol{\theta})

where $L$ is the negative log-likelihood function $L(\boldsymbol{x},y,\boldsymbol{\theta})=-\log p(y\mid \boldsymbol{x};\boldsymbol{\theta})$ . (This is just replacing the entire training set with the minibatch). Everything else proceeds equivalently.

SGD has effectively $O(1)$ performance has $m$ grows, and has phased out the previous popular approach of the kernel trick, which frequently required construction of an $m\times n$ matrix (cost of $O(m^{2})$ ). We build further on SGD later, in Chapter 8.

5.10 Building an ML Algorithm

Not too much to discuss here.

5.11 Challenges Motivating Deep Learning

Of course, the algorithms described here, while effective, did not generalize to many of the important problems presented in machine learning. Thus, we turn to why these algorithms failed, and why researchers turned to deep learning as a solution.

The Curse of Dimensionality

Frequently, ML problems become exceedingly difficult when the number of dimensions increases. Why? The number of distinct configurations of a set of variables increases exponentially in the number of dimensions.

This poses, for instance, a statistical challenge. The number of distinct configurations may be much larger than the number of training examples—how do we handle configurations for which we have seen no similar example?

Local Constancy and Smoothness Regularization

Recall the purpose of priors—implicit preferences imposed on the learning algorithm. One of the most popular prior is the smoothness or local constancy prior, which states that the learned function should not change much within a small region. Many simpler algorithms rely exclusively on this prior; however, it is insufficient for more sophisticated tasks.

Consider, for instance, $k$ -nearest neighbors. In the extreme case, $k=1$ , there are only $m$ distinguishable regions. In general, there are only $mk=O(k)$ distinguishable regions. Likewise with the decision tree, only $O(m)$ distinguishable regions can really be made to achieve some level of statistical confidence in predictions. And, most kernel machines interpolate between nearby training examples, like the $k$ -nearest neighbors family, effectively performing what's known as local template matching—suffering from the same problem.

This linearity with the number of training examples is the critical flaw. The space of different configurations is exponential in $m$ , but these algorithms' distinguishing regions in the input space are all linear in $m$ .

Consider, for instance, extending the checkerboard pattern to $n$ dimensions, for some large $n$ . Using only the local constancy/smoothness prior and local generalization, we can only really identify points lying in the same checkerboard "square" (hypercube, in $n$ -dimensional space) as an existing training example.

Thus, the smoothness assumption and associated non-parametric learning algorithm only work well if there are enough examples for it to observe the most peaks and valleys of the underlying distribution. This is generally true if the distribution is smooth and varies in few enough dimensions.

So how do we represent a more complicated, high-dimensional function, i.e. $O(2^{m})$ distinct regions/configurations, with only $O(m)$ examples? Well, consider that the checkerboard problem becomes easy if the algorithm is given a prior that implies the periodicity of the checkerboard. In general, the introduction of some additional assumptions about the underlying distribution can help the algorithm.

Of course, strong, task-specific assumptions like the periodicity of the checkerboard are typically not encoded into ML algorithms, so they can generalize well; instead, the core idea of deep learning is that we make the assumption that the underlying data was generated via a composition of factors or features, potentially in some sort of hierarchical fashion. This, and other generic assumptions, actually enable the exponential gain in relationship between the number of examples and the number of distinct regions identifiable by the algorithm. This also helps with the curse of dimensionality.

Manifold Learning

A manifold is a set of points where the local neighborhood around any given point appears to be a Euclidean space. For instance, the Earth is a spherical manifold in 3D space, but is experienced point-wise, by us, as a 2D plane. There is a more formal definition, but this is sufficient for our purposes—take a class on algebraic topology to learn more :)

In machine learning, a manifold tends to mean a connected set of points that can be approximated well considering only a small number of degrees of freedom, or dimensions, embedded in a higher-dimensional space. Think about what PCA does, for instance.

info

The dimensionality of a manifold may vary between points. A figure eight is a manifold with a single dimension at most points, but two dimensions at the intersection.

Manifold learning algorithms take advantage of this low dimensionality embedded in a high dimensional space by assuming most of the input space $\mathbb{R}^{n}$ is invalid, and that interesting inputs and variations occur only along a collection of small manifolds. This assumption is approximately correct with regards to most AI tasks, and is known as the manifold hypothesis.

First, the probability distribution over real-world items of interest (e.g. images, text, sound) is highly concentrated. Uniform noise never resembles real-world, structured input. This implies the underlying probability distribution is, in reality, highly patterned.

Second, there are frequently many transformations that can be applied to reach similar yet distinct examples in the local neighborhood. A simple example is image transformations, for instance. This motivates the traversal of manifolds, i.e. examples connected to similar examples, and the existence of various small manifolds scattered throughout the input space.

Thus, it's more natural and efficient for machine learning algorithms to learn the low-dimensional manifold underlying the high-dimensional data.

Chapter 5: Machine Learning Basics

5.1 Learning Algorithms

The Task, TTT

The Performance Measure, PPP

The Experience, EEE

Example: Linear Regression

5.2 Capacity, Overfitting and Underfitting

The No Free Lunch Theorem

Regularization

5.3 Hyperparameters and Validation Sets

Cross-Validation

5.4 Estimators, Bias, and Variance

Point Estimation

Bias

Example: Bernoulli Distribution

Example: Gaussian Distribution Estimators

Variance and Standard Error

Balancing Bias and Variance to Minimize MSE

Consistency

5.5 Maximum Likelihood Estimation

Conditional Log-Likelihood and MSE

MLE: Linear Regression

Properties of Maximum Likelihood

5.6 Bayesian Statistics

Bayesian Linear Regression

Maximum A Posteriori Estimation

5.7 Supervised Learning Algorithms

Probabilistic Supervised Learning

Support Vector Machines

Other Simple Supervised Learning Algorithms

5.8 Unsupervised Learning Algorithms

Principal Component Analysis

kkk-means Clustering

5.9 Stochastic Gradient Descent

5.10 Building an ML Algorithm

5.11 Challenges Motivating Deep Learning

The Curse of Dimensionality

Local Constancy and Smoothness Regularization

Manifold Learning

The Task, $T$

The Performance Measure, $P$

The Experience, $E$

$k$ -means Clustering