Logo

Chapter 3: Probability and Information Theory

3.2 Random Variables

A random variable is a variable that can take on different values, each associated with some probability. It may be discrete (finite/countably infinite number of states) or continuous (any real value).

3.3 Probability Distributions

A probability distribution describes the probabilities associated with the values of a random variable. For discrete variables, it is described by a probability mass function (PMF). For continuous variables, the analogue is a probability density function (PDF).

A PMF PP must satisfy the following properties

A PDF pp must satisfy the following, similar properties

p(x)1p(x)\leq1 is not required for a PDF because there are an uncountably infinite number of different states of x\mathrm{x}, and therefore p(x)>1p(x)>1 would not cause p(x)dx>1\int p(x) \,\mathrm{d}x>1.

Also, the nomenclature of a PDF (density instead of mass) is motivated by the fact that it does not actually provide the probability of a specific state directly; instead, the probability of landing inside an infinitesimal region with volume δx\delta x is provided by p(x)δxp(x)\cdot\delta x. (After all, the chance of each specific state is effectively 00, considering the uncountably infinite set of states).

Finally, we note the existence of joint probability distributions, distributions over many variables simultaneously, represented by P(x1,,xn)P(x_{1},\dots,x_{n}).

3.4 Marginal Probability

Consider a joint probability distribution. A marginal probability distribution is a probability distribution over a subset of its parameters/variables.

For instance, consider the joint probability distribution P(x,y)P(\mathrm{x},\mathrm{y}). Then,

P(x)=yP(x,y=y)P(\mathrm{x})=\sum_{y}P(\mathrm{x},\mathrm{y}=y)

For continuous variables, summation is simply replaced by integration.

p(x)=p(x,y)dyp(x)=\int p(x,y) \,\mathrm{d}y

3.5 Conditional Probability

Conditional probability is the probability of an event xx occurring given the occurrence of another event yy has already happened. This is denoted by P(yx)P(y\mid x), and is calculated as

P(yx)=P(x,y)P(x)P(y\mid x)= \frac{P(x,y)}{P(x)}

and is only defined when P(x)>0P(x)>0.

3.6 The Chain Rule of Conditional Probabilities

A joint probability distribution over several random variables can be decomposed into the product of several conditional distributions, each over only one variable.

P(x1,,xn)=P(x1)i=2nP(xix1,,xi1)P(x_{1},\dots ,x_{n})=P(x_{1})\prod_{i=2}^{n} P(x_{i}\mid x_{1},\dots ,x_{i-1})

For instance,

P(a,b,c)=P(ab,c)P(bc)P(c)P(a,b,c)=P(a\mid b,c)P(b\mid c)P(c)

The proof of this follows inductively from the definition of conditional probability.

3.7 Independence and Conditional Independence

Two random variables xx and yy are independent if their probability distribution can be expressed as

P(x,y)=P(x)P(y)P(x, y)=P(x)P(y)

Two random variables xx and yy are conditionally independent given a random variable zz if the conditional probability distribution over vv and yy factorizes in this way (i.e. the above equation) for all values of zz.

P(x,yz)=P(xz)P(yz)P(x, y\mid z)=P(x\mid z)P(y\mid z)

Independence and conditional dependence are denoted xyx\perp y and xyzx\perp y\mid z.

3.8 Expectation, Variance, and Covariance

The expectation or expected value of a function f(x)f(x) with respect to probability distribution P(x)P(x) is the mean value of f(x)f(x) when xx is randomly sampled from PP.

Discrete: ExP[f(x)]=xP(x)f(x)Continuous: Exp[f(x)]=p(x)f(x)dx\begin{align*} \text{Discrete: }& \mathbb{E}_{x\sim P}[f(x)]=\sum_{x}P(x)f(x) \\ \text{Continuous: }& \mathbb{E}_{x\sim p}[f(x)] = \int p(x)f(x) \,\mathrm{d}x \end{align*}

Expectation is linear, i.e. the Linearity of Expectation tells us that

E[αf(x)+βg(x)]=αE[f(x)]+βE[g(x)]\mathbb{E}[\alpha f(x)+\beta g(x)]=\alpha \mathbb{E}[f(x)]+\beta \mathbb{E}[g(x)]

Variance is a measure of how much a function of a random variable xx varies across samplings, and is described by

Var(f(x))=E[f(x)2](E[f(x)])2\mathrm{Var}(f(x)) =\mathbb{E}[f(x)^{2}]-(\mathbb{E}[f(x)])^{2}

Standard deviation is Var(f(x))\sqrt{ \mathrm{Var}(f(x)) }.

Covariance describes how closely two values are linearly related.

Cov(f(x),g(y))=E[(f(x)E[f(x)])(g(x)E[g(x)])]\mathrm{Cov}(f(x), g(y)) = \mathbb{E}[(f(x)-\mathbb{E}[f(x)])(g(x)-\mathbb{E}[g(x)])]

The covariance of a value with itself is simply the value's variance.

Cov(f(x),f(x))=Var(f(x))\mathrm{Cov}(f(x), f(x)) =\mathrm{Var}(f(x))

Notably,

Var(f(x)+g(y))=Var(f(x))+Var(g(y))+2Cov(f(x),g(y))\mathrm{Var}(f(x)+g(y)) =\mathrm{Var}(f(x)) +\mathrm{Var}(g(y)) +2\mathrm{Cov}(f(x), g(y))

Correlation is essentially correlation except each variable is normalized.

info

Note that independent variables have zero covariance, and variables with nonzero covariance are dependent. However—dependence and covariance are not the same. Two variables may be dependent and have zero covariance.

The covariance matrix of a random vector xRn\mathbf{x} \in \mathbb{R}^{n} is an n×nn\times n matrix such that Cov(x)i,j=Cov(xi,xj)\mathrm{Cov}(\mathbf{x})_{i,j}=\mathrm{Cov}(x_{i},x_{j}). We note that the diagonal elements of the covariance simply give the variances Var(xi)\mathrm{Var}(x_{i}).

3.9 Common Probability Distributions

Bernoulli

A distribution over a single binary random variable, parameterized by a single ϕ[0,1]\phi \in[0,1] such that P(X=1)=ϕP(X=1)=\phi. It possesses the following properties.

P(X=1)=ϕP(X=0)=1ϕP(X=x)=ϕx(1ϕ)1x, x{0,1}E[X]=ϕVar(X)=ϕ(1ϕ)\begin{align*} P(X=1)&=\phi \\ P(X=0)&=1-\phi \\ P(X=x) &= \phi^{x}(1-\phi)^{1-x},\ x \in \{ 0,1 \} \\ \mathbb{E}[X] &= \phi \\ \mathrm{Var}(X) &= \phi(1-\phi) \end{align*}

Multinoulli

Also known as the categorical distribution, it describes a single random variable with (finite) kk different states—effectively the generalization of a Bernoulli variable. It is parameterized by a vector p[0,1]k1\mathbf{p}\in[0,1]^{k-1} where pip_{i} denotes the probability of the iith state and the final, kkth state's probability is given by 11Tp1-\mathbf{1}^{T}\mathbf{p}. Note that 1Tp1\mathbf{1}^{T}\mathbf{p}\leq1 is a necessary constraints.

Gaussian Distribution

Also known as the normal distribution, it is parameterized by μR\mu \in \mathbb{R}, the mean, and σ(0,)\sigma \in (0,\infty), the standard deviation. Its probability density function (PDF) is described by

N(x; μ,σ)=12πσ2exp(12σ2(xμ)2)\mathcal{N}(x;\ \mu,\sigma)=\sqrt{ \frac{1}{2\pi\sigma^{2}} }\exp\left( -\frac{1}{2\sigma^{2}}(x-\mu)^{2} \right)

Frequently, the distribution is parameterized by β(0,)\beta \in(0,\infty) instead of σ\sigma where the two are related by β=1σ2\beta=\frac{1}{\sigma^{2}} to make evaluation of the PDF more efficient. β\beta is known as the precision or inverse variance. This does not change the distribution's behavior.

N(x; μ,β)=β2πexp(12β(xμ)2)\mathcal{N}(x;\ \mu,\beta)= \sqrt{ \frac{\beta}{2\pi} }\exp\left( -\frac{1}{2}\beta(x-\mu)^{2} \right)

The normal distribution is frequently used as a default choice for two reasons.

  1. The central limit theorem states that the sum of many independent random variables is approximately normal.
  2. The normal distribution encodes the maximum amount of uncertainty over R\mathbb{R}.

The normal distribution may also generalize to Rn\mathbb{R}^{n} as the multivariate normal distribution, where σ\sigma is replaced with a positive definite symmetric matrix Σ\Sigma that provides the covariance matrix of the distribution.

N(x; μ,Σ)=1(2π)ndet(Σ)exp(12(xμ)TΣ1(xμ))\mathcal{N}(x;\ \mu,\Sigma)=\sqrt{ \frac{1}{(2\pi)^{n}\det(\Sigma)} }\exp\left( -\frac{1}{2}(x-\mu)^{T}\Sigma ^{-1}(x-\mu) \right)

Where xx and μ\mu are now vectors instead of scalars. As before, it is possible to replace Σ\Sigma with a precision matrix β=Σ1\beta=\Sigma ^{-1} for increased efficiency of evaluation.

N(x; μ,β)=det(β)(2π)nexp(12(xμ)Tβ(xμ))\mathcal{N}(x;\ \mu,\beta)=\sqrt{ \frac{\det(\beta)}{(2\pi)^{n}} }\exp\left( -\frac{1}{2}(x-\mu)^{T}\beta(x-\mu) \right)

Frequently, the covariance matrix is fixed to be a diagonal matrix. Sometimes, the covariance matrix is also a scalar multiple of the identity matrix, in which case the distribution is known as an isotropic Gaussian distribution.

Exponential and Laplace Distributions

In deep learning, it's often desirable to have a probability distribution with a sharp point at x=0x=0. The exponential distribution provides this! It is parameterized by λ\lambda, and is described by

p(x; λ)=λ1x0exp(λx)p(x;\ \lambda)=\lambda \mathbf{1}_{x\geq 0}\exp(-\lambda x)

Where 1x0\mathbf{1}_{x\geq_{0}} is an indicator function that assigns probability 00 for all x<0x<0.

A related distribution is the Laplace distribution, which places a sharp peak of probability mass at an arbitrary point μ\mu. It is parameterized by μ\mu and a variable γ\gamma.

Laplace(x; μ,γ)=12γexp(xμγ)\mathrm{Laplace}(x;\ \mu,\gamma) = \frac{1}{2\gamma}\exp\left( -\frac{\lvert x-\mu \rvert }{\gamma} \right)

The Dirac Distribution and Empirical Distribution

Occasionally, it is desirable to place all probability mass in a distribution at a single point. This can be done by defining a PDF with the Dirac delta function δ(x)\delta(x), which is parameterized by the desired point μ\mu.

p(x)=δ(xμ)p(x)=\delta(x-\mu)

This is essentially defined such that it is zero-valued everywhere except μ\mu, but integrates to 11.

info

The Dirac delta function is not an ordinary function that associates each input with an output. Instead, it is known as a generalized function, which is defined in terms of its properties when integrated.

The Dirac delta distribution is also commonly used to construct an empirical distribution for continuous variable.

p(x)=1mi=1mδ(xxi)p(x)=\frac{1}{m}\sum_{i=1}^{m} \delta(x-x_{i})

Which essentially evenly distributes probability mass across the mm points (of some dataset of size mm) x1,,xmx_{1},\dots,x_{m} (each with 1m\frac{1}{m} mass). This essentially discretizes the distribution. (For discrete variables, the empirical distribution is trivially represented with a multinoulli distribution).

In the context of deep learning, the empirical distribution may represent the proportion of the dataset each item of training data may represent (given possible imbalances in frequencies of training data).

Mixtures of Distributions

It's also possible to define probability distributions by combining several others together. Sampling from such a mixture distribution involves first randomly choosing a component distribution, according to some multinoulli distribution, and then sampling from this component distribution. Thus, the probability of choosing some xx from this mixture distribution is

P(x)=iP(c=i)P(xc=i)P(x)=\sum_{i}P(c=i)P(x\mid c=i)

Where P(c)P(c) is the multinoulli distribution over the different component distributions. cc is known as the component identity random variable.

example

The empirical distribution is a mixture distribution composed of several Dirac components.

The mixture model hides a much more interesting concept, which will be discussed in depth later in section 16.5. A latent variable is a random variable that cannot be observed directly. In this case, the component identity random variable cc is a latent variable, and is related to the random variable xx through a joint distribution, i.e. P(x,c)=P(xc)P(c)P(x,c)=P(x\mid c)P(c).

A common type of mixture model is the Gaussian mixture model, in which the components are distinct Gaussian distributions. The parameters of a Gaussian mixture model, beyond its usual means μ\mu's and covariances Σ\Sigma's, also includes a prior probability αi=P(c=i)\alpha_{i}=P(c=i) for each component ii. This is just the parameter p\mathbf{p} for the multinoulli distribution across the component distributions. Its nomenclature notes that αi\alpha_{i} represents the model's beliefs about cc before it has observed xx, the result of sampling the Gaussian mixture. Meanwhile, P(cx)P(c\mid x) is a posterior probability, as it represents the model's beliefs about cc after it has observed xix_{i}.

What does this even mean, though? Well, in real-world instances, where we don't know the distribution that describes a dataset, it's desirable to iteratively compute a closer and closer approximation of the distribution. This is known as Bayesian inference, and Gaussian mixture models are frequently used for this process because they are universal approximators of densities, i.e. any smooth density function/distribution can be approximated with arbitrarily small, nonzero error by a Gaussian mixture model with enough components.

Thus, during Bayesian inference, the prior probabilities represent the model's beliefs about the distribution of the component Gaussians, and the posterior probabilities represent the model's updated beliefs about the distribution after observing a new data point xx sampled from the real-world distribution.

tip

Think of Gaussian mixture models as the equivalent of a Taylor series for density distributions.

3.10 Useful Properties of Common Functions

Some functions commonly appear when working with probability distributions.

The logistic sigmoid is described by

σ(x)=11+exp(x)\sigma(x)=\frac{1}{1+\exp(-x)}

and is frequently used to produce the ϕ\phi parameter of a Bernoulli distribution because its range is (0,1)(0,1). Notably, the sigmoid function saturates when its argument is very positive or very negative, i.e. ddxϕ\lvert \frac{\textrm{d}}{\textrm{d} x }\phi \rvert becomes small.

logistic-sigmoid.png

The softplus function is described by

ζ(x)=log(1+exp(x))\zeta(x)=\log(1+\exp(x))

and is frequently used to produce the β\beta or σ\sigma parameter of a normal distribution because its range is (0,)(0,\infty). It's also frequently observed when working with sigmoids. Notably, its name describes its initial design—a smoothed or "softened" version of x+=max(0,x)x^{+}=\mathrm{max}(0,x).

softplus.png

Now, a list of some mathematical formulas involving these functions.

σ(x)=exp(x)exp(x)+exp(0)ddxσ(x)=σ(x)(1σ(x))1σ(x)=σ(x)logσ(x)=ζ(x)ddxζ(x)=σ(x)σ1(x)=log(x1x), x(0,1)ζ1(x)=log(exp(x)1), x>0ζ(x)=xσ(y)dyζ(x)ζ(x)=x\begin{align*} \sigma(x) &= \frac{\exp(x)}{\exp(x)+\exp(0)} \\ \frac{\textrm{d}}{\textrm{d} x } \sigma(x)&=\sigma(x)(1-\sigma(x)) \\ 1-\sigma(x) &= \sigma(-x) \\ \log\sigma(x) &= -\zeta(-x) \\ \frac{\textrm{d}}{\textrm{d} x } \zeta(x) &= \sigma(x) \\ \sigma ^{-1}(x)&=\log\left( \frac{x}{1-x} \right),\ \forall x \in(0,1) \\ \zeta ^{-1}(x)&=\log(\exp(x)-1),\ \forall x>0 \\ \zeta(x) &= \int_{-\infty}^{x} \sigma(y) \, \mathrm{d}y \\ \zeta(x)-\zeta(-x) &= x \end{align*}
info

σ1(x)\sigma ^{-1}(x) is called a logit.

Also, note that the last equation, ζ(x)ζ(x)=x\zeta(x)-\zeta(-x)=x, resembles x+x=xx^{+}-x^{-}=x, where x+=max(0,x)x^{+}=\mathrm{max}(0,x) and x=max(0,x)x^{-}=\mathrm{max}(0,-x). This is partly why ζ\zeta was chosen for "softplus" purpose.

3.11 Bayes' Rule

In essence,

P(xy)=P(x)P(yx)P(y)P(x\mid y)=\frac{P(x)P(y\mid x)}{P(y)}

3.12 Technical Details of Continuous Variables

It's necessary to briefly discuss some details of measure theory to formalize some of our notions of continuous variables. For our purposes, we mostly care about measure theory when describing theories that applies to most, but not all, points in Rn\mathbb{R}^{n}.

A set of points that is negligibly small is said to have measure zero. Intuitively, this means such a set occupies no volume in the space we are measuring. For instance, in R3\mathbb{R}^{3}, a 3D object has positive measure, but a 2D (polygon) or 1D (line) has measure zero. Also, note that the union of countably many sets that each have measure zero also has measure zero. In particular, the set of all rational numbers Q\mathbb{Q} has measure zero in R\mathbb{R} because QZ×Z\mathbb{Q}\cong \mathbb{Z}\times \mathbb{Z}, and Z\mathbb{Z} has measure zero.

Almost everywhere is also a useful term for us from measure theory. It is a qualifier for a property that holds throughout all of space except on a set of measure zero. As the exceptions occupy a negligible amount of space, they are ignored for most applications. There are some results in probability theory that hold for all discrete values but hold only "almost everywhere" for continuous values.

Finally, another technical detail of continuous variables involves continuous random variables that are functions of one another. Consider random variables x\mathbf{x} and y\mathbf{y}, such that y=g(x)\mathbf{y}=g(\mathbf{x}) and gg is an invertible, continuous, differentiable transformation. Perhaps contrary to expectations, py(y)px(g1(y))p_{y}(\mathbf{y})\neq p_{x}(g^{-1}(\mathbf{y})).

Consider, for instance, g(x)=x2g(x)=\frac{x}{2} and xU(0,1)\mathbf{x}\sim U(0,1), i.e. the uniform distribution over [0,1][0,1]. If we used the false equation above, i.e. py(y)=px(g1(y))p_{y}(\mathbf{y})=p_{x}(g^{-1}(\mathbf{y})), this would imply py(y)=px(2y)p_{y}(y)=p_{x}(2y), which leads to the conclusion py(y)=1p_{y}(y)=1 when y[0,12]y\in\left[ 0, \frac{1}{2} \right], and 00 otherwise. Then, this would imply

py(y)dy=12\int p_{y}(y) \,\mathrm{d}y =\frac{1}{2}

which is obviously invalid for a density distribution.

Why is this approach wrong, then? It's because it fails to consider the distortion of space caused by gg. The probability of xx lying in an infinitesimally small region with volume δx\delta x is given by p(x)xp(x)\cdot \partial x. However, the infinitesimal volume surrounding xx in xx space may have different volume in yy space, since gg can transform the scale of space.

We can correct the issue, though. We just need to preserve

py(g(x))y=px(x) x\lvert p_{y}(g(x))\cdot \partial y \rvert =\lvert p_{x}(x)\ \cdot \partial x \rvert

We can derive from this

px(x)=py(g(x))g(x)x\begin{align*} p_{x}(x) &= p_{y}(g(x)) \left\lvert \frac{ \partial g(x) }{ \partial x } \right\rvert \end{align*}

In higher dimensions, g(x)x\frac{ \partial g(x) }{ \partial x } generalizes to the determinant of the Jacobian matrix, i.e. the matrix with Ji,j=xiyjJ_{i,j}=\frac{ \partial x_{i} }{ \partial y_{j} }.

px(x)=py(g(x))det(g(x)x)p_{x}(x)=p_{y}(g(x))\left\lvert \det\left( \frac{ \partial g(x) }{ \partial x } \right) \right\rvert

3.13 Information Theory

Information theory is the study of quantifying how much information is present in a signal. However, it's useful for us to characterize probability distributions or quantify similarity between distributions.

The fundamental idea in information theory is that learning that an unlikely event has occurred is more informative than learning that a likely event has occurred. More specifically,

A metric that satisfies all three of the above properties is the measure of the self-information of an event xx.

I(x)=logP(x)I(x)=-\log P(x)
warning

In this book, log\log always means the natural logarithm.

With log=ln\log=\ln, I(x)I(x) is in the units of nats. One nat is the amount of information gained by observing an event of probability 1e\frac{1}{e}. Had we used log=log2\log=\log_{2}, this would be replaced by bits or shannons. The idea is the same, however.

This function can naturally be applied to discrete variables. However, continuous variables lose some properties—an event with unit density (i.e. a single point in the continuous domain) has zero information, despite not being guaranteed to occur.

Self-information also considers only a single outcome. The uncertainty in an entire probability distribution can be quantified with Shannon entropy.

H(x)=ExP[I(x)]=ExP[logP(x)]H(x)=\mathbb{E}_{x\sim P}[I(x)]=-\mathbb{E}_{x\sim P}[\log P(x)]

This is also denoted H(P)H(P), and describes the expected amount of information provided by an event drawn from the distribution PP. It simultaneously provides a lower bound on the number of bits needed, on average, to encode symbols drawn from a distribution PP. Nearly deterministic distributions have low entropy; near uniform distributions have high entropy. When xx is continuous, Shannon entropy is additionally known as differential entropy.

We can also measure the difference between two distributions P(x)P(x) and Q(x)Q(x) over the same random variable xx using Kullback-Leibler (KL) divergence.

DKL(PQ)=ExP[logP(x)Q(x)]=ExP[logP(x)logQ(x)]D_{KL}(P\parallel Q)=\mathbb{E}_{x\sim P}\left[ \log \frac{P(x)}{Q(x)} \right]=\mathbb{E}_{x\sim P}[\log P(x)-\log Q(x)]

For discrete variables, it can be interpreted (in information theory terms) as the extra amount of bits/nats required to send a message containing symbols drawn from a probability distribution PP when using a code designed to minimize the length of messages drawn from probability distribution QQ.

KL divergence has several useful properties.

warning

DKLD_{KL} is not a true distance measure between distributions because it is asymmetric. DKL(PQ)DKL(QP)D_{KL}(P\parallel Q)\neq D_{KL}(Q\parallel P) for all P,QP,Q.

A closely related quantity to KL divergence is the cross-entropy H(P,Q)=H(P)+DKL(PQ)H(P,Q)=H(P)+D_{KL}(P\parallel Q).

H(P,Q)=ExPlogQ(x)H(P,Q)=-\mathbb{E}_{x\sim P}\log Q(x)

Minimizing cross-entropy with respect to QQ is equivalent to minimizing the KL divergence, since H(P)H(P) is entirely dependent on PP (recall that H(P)H(P) is the Shannon entropy of a distribution PP).

warning

0log00\log0 frequently appears with these expressions. By convention, these are treated as limx0xlogx=0\lim_{ x \to 0 }x\log x=0.

3.14 Structured Probabilistic Models

Frequently, in machine learning, we frequently encounter probability distributions over a very large number of random variables, but with relatively few interactions between variables. Using a single function to describe the joint probability distribution is, thus, very inefficient. Instead, it's better to factor the distribution into many factors.

Consider three random variables a,b,ca,b,c, such that aa influences bb and bb influences cc, but a,ca,c are independent given bb. Then, we can factor the probability distribution as

p(a,b,c)=p(a)p(ba)p(cb)p(a,b,c)=p(a)p(b\mid a)p(c\mid b)

These factorizations significantly reduce the number of parameters required to describe the distribution, as each distribution p()p(\dots)'s number of parameters is exponential in the number of variables in ()(\dots). (e.g., p(a,b,c)p(a,b,c) has O(3n)O(3^{n}) parameters while p(a)p(a) has O(1n)O(1^{n})).

A probability distribution's factorization is frequently described using a graph, known as a structured probabilistic model or graphical model.

There are two types of structured probabilistic models: directed and undirected. In both graphs, each node corresponds to a random variable, and each edge (u,v)(u,v) denotes a direct interaction between uu and vv.

Directed models represent factorizations into conditional probability distributions, like above. In particular, a directed model contains one factor for every random variable xix_{i} in the distribution, and that factor consists of the conditional distribution over xix_{i} given the parents of xix_{i}, denoted PaG(xi)P_{a_{\mathcal{G}}}(x_{i}).

p(x)=ip(xiPaG(xi))p(x)=\prod_{i}p(x_{i}\mid P_{a_{\mathcal{G}}}(x_{i}))

directed-graph-model.png

Undirected models represent factorizations into a set of functions that are typically not probability distributions. Any subset of nodes that is fully connected in G\mathcal{G} is called a clique. Each clique C(i)\mathcal{C}^{(i)} in an undirected model is associated with a factor ϕ(i)(C(i))\phi^{(i)}(C^{(i)}), which, again, is not a probability distribution but any function such that its range is nonnegative.

The probability of a certain assignment/results of the random variables is proportional to the product of all of these factors, when provided the input of the assignment. In order to convert this to probabilities, a normalizing constant ZZ is applied, where ZZ is the sum or integral of all states of the ϕ\phi functions and

p(x)=1Ziϕ(i)(C(i))p(x)=\frac{1}{Z}\prod_{i}\phi^{(i)}(C^{(i)})

undirected-graph-model.png

Notably, these two graphical representations of factorizations do not denote two mutually exclusive families of probability distributions; a probability distribution may be described with either method.