Chapter 2: Linear Algebra

warning

Most of these concepts are discussed in an linear algebra class. Thus, I only included concepts that I felt were helpful for me to remember (admittedly, this is most of the nontrivial topics in this chapter). This is not a comprehensive review of all the linear algebra explained in the book.

2.4 Linear Dependence and Span

A square matrix with linearly dependent columns is known as singular.

2.5 Norms

The $L^{p}$ norm is given by

\lVert \boldsymbol{x} \rVert_{p} = \left( \sum_{i}\lvert x_{i} \rvert ^{p} \right)^{1/p}

for $p \in R,\ p\geq1$ .

The term norm is not limited to $L^{p}$ norms, but rather describes any function $f$ mapping vectors to non-negative real numbers ( $\mathbb{R}\;\backslash\;\mathbb{R}^{-}$ ) such that

$f(\boldsymbol{x})=0\implies \boldsymbol{x}=\mathbf{0}$ .
$f(\boldsymbol{x+y})\leq f(\boldsymbol{x})+f(\boldsymbol{y})$ . (triangle inequality)
$\forall\alpha \in \mathbb{R},\ f(\alpha \boldsymbol{x})=\lvert \alpha \rvert f(\boldsymbol{x})$ .

The $L^{2}$ norm is the Euclidean norm. The $L^{\infty}$ norm is the max norm, and is equivalent to $\lVert \boldsymbol{x} \rVert_{\infty}=\mathrm{max}_{i}\lvert x_{i} \rvert$ . Finally, the Frobenius norm is a non- $L^{p}$ norm described by

\lVert A \rVert _{F}=\sqrt{ \sum_{i,j}{A_{i,j}}^{2} }

which measures the size of a matrix $A$ . It is analogous to the $L^{2}$ norm or matrices instead of vectors.

2.6 Special Kinds of Matrices and Vectors

An orthogonal matrix is a square matrix whose rows are mutually orthonormal and whose columns are mutually orthonormal.

A^{\top}A=AA^{\top}=I

This follows from the definition of orthonormality of vectors $\boldsymbol{x},\boldsymbol{y}$ .

$\boldsymbol{x}^{\top}\boldsymbol{y}=0$ .
$\lVert \boldsymbol{x} \rVert_{2}=1$ .

Notably, matrix orthogonality implies that $A^{-1}=A^{\top}$ .

2.7 Eigendecomposition

An eigenvector of a square matrix $\boldsymbol{A}$ is a non-zero vector $\boldsymbol{v}$ such that multiplication by $\boldsymbol{A}$ alters only the scale of $\boldsymbol{v}$ , and not the angle.

\boldsymbol{Av}=\lambda \boldsymbol{v}

The scalar $\lambda$ is the eigenvalue corresponding to this eigenvector $\boldsymbol{v}$ . By convention, we typically concern ourselves with only unit eigenvectors (since $\alpha\boldsymbol{v}$ is still an eigenvector for any $\alpha \in \mathbb{R}\;\backslash\;\{ 0 \}$ ).

Suppose a matrix $\boldsymbol{A}$ has $n$ linearly independent eigenvectors $\{ \boldsymbol{v}^{(1)},\dots,\boldsymbol{v}^{(n)} \}$ with eigenvalues $\{ \lambda_{1},\dots,\lambda_{n} \}$ . Let $\boldsymbol{V}=\begin{bmatrix}\boldsymbol{v}^{(1)} & \dots & \boldsymbol{v}^{(n)}\end{bmatrix}$ and $\boldsymbol\lambda=\begin{bmatrix}\lambda_{1} & \dots & \lambda_{n}\end{bmatrix}^{\top}$ . Then, the eigendecomposition of $\boldsymbol{A}$ is

\boldsymbol{A}=\boldsymbol{V}\mathrm{diag}(\boldsymbol\lambda)\boldsymbol{V}^{-1}

Where $\mathrm{diag}(\boldsymbol\lambda)$ is the diagonal matrix constructed by placing the elements of $\boldsymbol\lambda$ on the diagonals. Frequently, $\mathrm{diag}(\boldsymbol\lambda)$ is represented by $\Lambda$ .

Notably, every real symmetric matrix can be eigendecomposed with only real-valued eigenvectors and eigenvalues.

\boldsymbol{A}=\boldsymbol{Q}\boldsymbol{\Lambda}\boldsymbol{Q}^{\top}

Where $\boldsymbol{Q}$ is an orthogonal matrix composed of eigenvectors of $\boldsymbol{A}$ and $\boldsymbol{\Lambda}$ is a diagonal matrix. Since $\boldsymbol{Q}$ is orthogonal, $\boldsymbol{A}$ can be interpreted as scaling space by $\lambda_{i}$ in direction $v^{(i)}$ .

Also, note that eigendecompositions are not necessarily unique; if two or more eigenvectors share the same eigenvalue, any set of orthogonal vectors lying in their span are also eigenvectors. Additionally, a matrix is also singular $\iff$ at least one eigenvalue is zero.

The eigendecomposition of a real symmetric matrix can also optimize quadratic expressions of the form $f(\boldsymbol{x})=\boldsymbol{x}^{\top}\boldsymbol{A}\boldsymbol{x}$ subject to the constraint $\lVert \boldsymbol{x} \rVert_{2}=1$ . The maximum value of $f$ is the maximum eigenvalue, and the minimum is the minimum eigenvalue.

Finally, we can characterize some matrices based on their eigenvalues.

Positive Definite: $\lambda_{i}>0,\ \forall i$ .
Positive Semidefinite: $\lambda_{i}\geq0,\ \forall i$ .
Negative Definite: $\lambda_{i}<0,\ \forall i$ .
Negative Semidefinite: $\lambda_{i}\leq0,\ \forall i$ .

Positive semidefinite matrices guarantee that $\forall \boldsymbol{x},\ \boldsymbol{x}^{\top}\boldsymbol{Ax}\geq0$ . Positive definite matrices additionally guarantee $\boldsymbol{x}^{\top}\boldsymbol{Ax}=0\implies \boldsymbol{x}=\mathbf{0}$ .

2.8 Singular Value decomposition

Singular value decomposition (SVD) is another method of factorizing a matrix, this time into singular vectors and singular values. SVD is more general than eigendecomposition because it applies to all matrices, while eigendecomposition applies to only some square matrices. SVD has the following form

\boldsymbol{A}=\boldsymbol{UDV}^{\top}

Let $\boldsymbol{A}$ be an $m\times n$ matrix. Then $\boldsymbol{U}$ is an $m\times m$ matrix, $\boldsymbol{D}$ is an $m\times n$ matrix, and $\boldsymbol{V}$ is an $n\times n$ matrix. Additionally, $\boldsymbol{U}$ and $\boldsymbol{V}$ are orthogonal matrices, and $\boldsymbol{D}$ is a diagonal matrix.

The elements along the diagonal of $\boldsymbol{D}$ are the singular values of the matrix $\boldsymbol{A}$ . The columns of $\boldsymbol{U}$ are the left-singular vectors, and the columns of $\boldsymbol{V}$ are known as the right-singular vectors.

We can further interpret/compute the SVD of $\boldsymbol{A}$ via the eigendecomposition of related matrices. The left-singular vectors of $\boldsymbol{A}$ are the eigenvectors of $\boldsymbol{AA}^{\top}$ , and the right-singular vectors are the eigenvectors of $\boldsymbol{A^{\top}A}$ . The non-zero singular values of $\boldsymbol{A}$ are the square roots of the eigenvalues of $\boldsymbol{A^{\top}A}$ , which are also equivalent to the eigenvalues of $\boldsymbol{AA^{\top}}$ .

2.9 The Moore-Penrose Pseudoinverse

Matrix inversion is undefined for non-square matrices. The Moore-Penrose pseudoinverse is designed to compute an inverse $\boldsymbol{A}^{+}$ for a matrix $\boldsymbol{A}$ in the equation $\boldsymbol{Ax}=\boldsymbol{y}$ that:

For wide matrices, produces a solution $\boldsymbol{x}=\boldsymbol{A^{+}y}$ with minimal Euclidean norm $\lVert x \rVert_{2}$ among all possible solutions. (Since wide matrices can have more than one solution)
For tall matrices, produces an $\boldsymbol{x}=\boldsymbol{A^{+}y}$ such that it minimizes the Euclidean norm $\lVert \boldsymbol{Ax}-\boldsymbol{y} \rVert_{2}$ . (Since tall matrices can have zero solutions).

The pseudoinverse is defined formally as

\boldsymbol{A}^{+}=\lim_{ \alpha \to 0^{+} } (\boldsymbol{A^{\top}A}+\alpha \boldsymbol{I})^{-1}\boldsymbol{A^{\top}}

and practically calculated using the matrices in SVD

\boldsymbol{A}^{+}=\boldsymbol{VD}^{+}\boldsymbol{U}^{\top}

Note that the psuedoinverse of any diagonal matrix, including $\boldsymbol{D}$ , can be computed by taking the reciprocal of its nonzero elements and then transposing the matrix.

Why transpose?

The matrix may not be square!

2.10 The Trace Operator

The trace operator represents the summation of a matrix's diagonal.

\mathrm{Tr}(\boldsymbol{A})=\sum_{i}\boldsymbol{A}_{i,i}

It is a useful and interesting operator for several reasons. For instance, the Frobenius norm of a matrix can be expressed as

\lVert \boldsymbol{A} \rVert _{F}=\sqrt{ \mathrm{Tr}(\boldsymbol{AA}^{\top}) }

It is invariant under the transpose operator.

\mathrm{Tr}(\boldsymbol{A})=\mathrm{Tr}(\boldsymbol{A}^{\top})

And under "cyclic permutation" of a product.

\mathrm{Tr}(\boldsymbol{ABC})=\mathrm{Tr}(\boldsymbol{CAB})=\mathrm{Tr}\boldsymbol{BCA}

Also, note that a scalar is its own trace.

a=\mathrm{Tr}(a)

2.11 The Determinant

The determinant $\det(\boldsymbol{A})$ is a function mapping a square matrix $\boldsymbol{A}$ to a real scalar. It is equivalent to the product of all the eigenvalues of the matrix, and can be interpreted as a measure of how much multiplication by the matrix $\boldsymbol{A}$ dilates space. If $\det(\boldsymbol{A})=0$ , space is contracted completely along at least one dimension. If $\det(\boldsymbol{A})=1$ , the transformation preserves volume.