Bayesian inference: Introduction

On the notations

Consider the following function defined as:

f:A\times \Theta\to B \\f:(x;\theta) \mapsto y\quad \text{i.e.} \ y:=f(x;\theta)

The set A is the domain of the function and x is the argument of it. The set \Theta is the parameter space and the parameters theta‘s of the function belongs to that set. For example f(x,y;\alpha,\beta):=\alpha cos(x^y/\beta)\ \text{ s.t } (\alpha,\beta)\in \R^2 has two parameters. The following notation is sometimes used to differentiate between the argument and parameter variables of a function:


it reads “f of x given \theta.

In the probability theory, the concept of conditional probability also uses the same notation, i.e. P(A|B) that reads the probability of the event A given that the event B has occurred; or, the probability of -the event- A conditional on -the event- B. In the case of random variables/vectors, we write P(Y\in R|X\in S).

The definition of the probability density/mass functions also uses the aforementioned notation. Consider a random variable/vector X with a probability density function (pdf) having some parameters. Then the pdf is written as:


This should be understood from the context that \theta is not a random variable/vector and hence the above does not imply a conditional pfd. As \theta pertains to a probability distribution (model), another letter is sometimes added to the above notation in order to include the model’s type or some information on the model :


A conditional pdf is identified with the same notation. Considering two random variables/vectors X and Y, we write:

f_{X|Y=y}(x|y)\equiv f_{X|Y}(x|y)\equiv f(x|y):=\frac{f_{X,Y}(x,y)}{f_Y(y)}\\ \ \\ f_{X|Y}(x|y):\R^m\times\R^n \to \R^+

which is a function of x if y is already fixed. This reads the pdf of X given -the observed and fixed value of- Y. Note that the notation f_{X|Y}(x|y) is just a function notation.

A conditional pdf (as a result of normalizing a joint pdf which has parameters) has parameters; if the parameters of the conditional pdf are collected in a vector and denoted by the vector \theta, the conditional pdf can be represented as:

f_{X|Y}(x|y,\theta,M)\equiv f(x|y,\theta,M)

such that it is a function of x for a fixed value of the random variable y and fixed values of the parameters \theta. So, the notation “|” is used to indicate both the parameters of a function and conditions on/of random variables involved in a pdf. Whatever precedes “|” is (are) the function argument(s). Henceforward, I use the notation “|” for both of the purposes and the context will imply the meaning.

Parametric inference

For a continuous random vector X\in R\sub \R^n, a continuous joint pdf f_X(x) can be defined as:

f:R\to\R^{+}\\P[x\in A]=\int_A f_X(x|\theta) dA

where \theta \in D_\theta is a set of parameters involved in the definition of the pdf.

The form f_X(x;\theta) can also be interpreted as a function of \theta at a constant x:

L:D_\theta\to \R^{+}\\ L(\theta)\equiv L_x(\theta)\equiv L(\theta|x):=f_X(x|\theta)\tag{1}

This function is called the likelihood function. The notation L(\theta |x) reads the likelihood of \theta given the (observed) data x. Because \theta is not a random variable/vector, the concept of conditional pdf is irrelevant. This is a function of the parameter \theta, not x, hence, it is not a density function. Depending on the variable in consideration, the function f_X(x;\theta) can be interpreted as either the density function or the likelihood function. It should be noted that the parameter(s) belongs to a model; this is usually implied implicitly. For example the pdf of a normally distributed random variable has two parameters \mu and \sigma^2.

Usually, the parameters of a probability distribution are unknown. Therefore, there is uncertainty about the parameters. The parameters can then be collected in a random vector \Theta. Now both the data and the parameters are random vectors (or variables). A joint pdf can be imagined for the data and the parameters of the pdf of the data:

f_{X,\Theta}(x,\theta)\equiv f_{X,\Theta}(x,\theta|M)

The notation |M reads “given a/the model M“. This notation is used merely to convey the fact that \theta pertains to a model and also to express extra information about the model; indeed, it does not indicate a (quantitative) conditional on M.

Consequently, the following conditional pdf of \Theta given the data X=x can be written as:

f_{\Theta|X=x}(\theta|x)\equiv f_{\Theta|X}(\theta|x)\equiv f(\theta|x):=\frac{f_{X,\Theta}(x,\theta)}{f_X(x)}

For a given x, i.e. data, the above is a function of \theta. Repeating the conditional pdf law, we can write:

f_{\Theta|X}(\theta|x)=\frac{f_{X|\Theta}(x|\theta)f_\Theta(\theta)}{f_X(x)}\\ \text{or} \\ \ \tag{2}\\ f_{\Theta|X}(\theta|x,M)=\frac{f_{X|\Theta}(x|\theta,M)f_\Theta(\theta|M)}{f_X(x|M)}

M indicates that a model (joint distribution of data and parameters) is already assumed; it can be an implicit assumption. Also, any other background information about the model (e.g. the range of the parameters) is integrated into M. The terms are named and explained as:

1) The likelihood: f_{X|\Theta}(x|\theta) is called sampling distribution. This is the distribution (of the observed) data conditional on the parameters. In other words, it tells us the pdf value of the data (measured) given some values of the parameters. This term is the pdf of data, x, conditioned on \theta, however, once the data have been observed/measured/given and hence becomes a fixed variable, the term becomes a function of the parameters (\theta). Therefore, this term is referred to as the likelihood of the data. There is a subtle point here. The term is naturally a pdf in x and hence can be imagined as:


Therefore, h_X(x;\theta) becomes a likelihood function if considered as a function of \theta, i.e. L(θ∣x):=h_X​(x|θ) (once the data have been observed/measured, it automatically becomes a function of the parameters). This likelihood function is the key function describing both the phenomenon (expressed by the model) and the measurements. Sometimes it is referred to as the likelihood of the parameters; but it is not really a suitable name because the likelihood uses the same function form identical to the conditional pdf in data. Moreover, the unit of a f_{X|\Theta}(x|\theta) is D^{-1} where D is the unit of data. This is because the unit of f_{\Theta}(\theta) is \frac{1}{M} and the unit of f_{X,\Theta}(x,\theta) is \frac{1}{D}\frac{1}{M}.

2) The prior: f_\Theta(\theta) is called the prior distribution of \Theta. This expresses the distribution of \Theta before observing the data (regardless of data). It encapsulates the information we have or assumed about the possible values of the model parameters regardless of the data.

3) The evidence: f_X(x) is the distribution (marginal pdf) of data. In mathematical notation: f_X(x|M)= \int_{\text{all values of }\theta} f_{\Theta,X}(\theta,x|M)d\theta. The evidence is not a function of the parameters, therefore, it is just a normalization constant for the posterior.

4) The posterior: f_{\Theta|X}(\theta|x) is called the posterior distribution (pdf) of \Theta. This is the pdf of the parameters \Theta (function of the parameters) given the model information and (observed) data.

Noting and keeping in mind that each of the different pdf’s in Eq. 2 is a function of the variable preceding “|” and pertaining to the pdf of the function’s argument, we usually write the equation as (f is replaced with p):

p(\theta|x,M)=\frac{\textcolor{blue}p(x|\theta,M)\textcolor{LimeGreen}p(\theta|M)}{\textcolor{red}p(x|M)}\\ \ \\ \text{conditional pdf}=\frac{\text{conditional pdf}\times \text{marginal pdf}}{\text{marginal pdf}} \tag{3}

By the aforementioned notation, the term(s) after “|” conveys that the pdf is a conditional pdf if the term(s) pertains (be a value of) a random vector/variable. E.g: p(x|\theta,M)\equiv p_{X|\Theta}(x|\theta,M).

For given/measured data (fixed), the numerator of Eq. 3 is a function of the parameters while the denominator is a constant. Therefore, the following representations are defined and usually used in inference:

p(\theta|x,M)=\frac{1}{Z}p(x|\theta,M)p(\theta|M)\\ \ \\ p^*(\theta|x,M):=p(x|\theta,M)p(\theta|M)\tag{4}

Application: Data modeling with parametric models

A generative model (also called the forward model) is the theoretical agent (usually a mathematical equation) that simulates/generates/predicts the observable data; the model is identified by its form (equation) and its parameters. A general form of a non-stochastic or deterministic generative model can be written as:

f:\R^m \to\R^n\\ y_s:=f(x|\theta_{f})\equiv f(x;\theta_{f})

where y_s is the simulated data/measurements, x is the index at which the data is measured (e.g time, point at space, etc), and \theta_{f} \in \R^k contains the parameters of the function (model). A stochastic model possesses some randomness; unlike a deterministic model that produces the same output for the same parameters (and initial conditions), a stochastic model produces different outputs. Stochastic models predict distribution(s) over the outputs.

A generative model predicts the data in the absence of measurements noise i.e. y\approx y_s for measurements y. It means that the models gives the true value while the measurements are polluted with noise introducing uncertainty into the measurements. We should distinguish between the true/ideal data and the noisy data/measurements. Noisy data (out of measurements) is what we usually have. The model form and its parameters are always determined based on the observed noisy data/measurements. Therefore, we should know how the measurement process influences the true data/values. To this end, a measurement model also called a noise model is also to be defined. This model contains the uncertainty/noise in the measured data. The measurement model can then be written as y=y_s+\epsilon. The variable \epsilon is actually a random variable/vector introducing the noise.Consequently, the measurement, y becomes a random variable/vector as well. Having said that and using capital or bold letters for random vectors/variable and small or non-bold letters for non-random variables, we write:

Y=y_s+\boldsymbol {\epsilon} \quad \in \Omega \to \R^n\\ \text{ s.t } y_s:=f(x|\theta_{f})\quad \text{and }\ \boldsymbol \epsilon:=\epsilon(\omega):\Omega\to \R^n\tag{5}

where \Omega is a probability sample space. Note that f(x|\theta_{f}) is a function of x not a pdf. \bold y is inherently a stochastic process/random field, i.e. at each x (index), a random variable exits. The noise, \epsilon has a distribution and indeed a pdf, p_{\boldsymbol \epsilon}(\epsilon|\theta_{\epsilon}), with its own parameters \theta_{\epsilon}. In such a model, the noise average is set to be zero, i.e. E[\epsilon]=0, therefore, the mean of a measurement at each point is equal to the ideal data. This is simply because the simulating part is not a random vector, hence, E[Y]=f(x;\theta_{f})+0.

Because y_s is not a random variable/vector, the probability distribution of Y and of the noise (model) are the same with different means. In such a data modeling, the noise distribution is assumed (known) and the simulating function parameters are sought. This approach is not Bayesian.

A Bayesian approach to data modeling arises when the parameters of the generative model (simulating function) are also considered to be random variables, \Theta_f; this means there is uncertainty introduced into the parameters. hence, a probability model is designated to the parameters. The data model can now be written as:

Y=Y_s+\boldsymbol {\epsilon} \quad \in \Omega \to R^n\\ \ \\ \text{ s.t }\quad Y_s:=f(x|\Theta_{f})\equiv f(x;\Theta_{f}) \quad \text{and }\ \boldsymbol \epsilon:=\epsilon(\omega):\Omega\to \R^n\tag{6}

where x indexes Y_s. Now the measurement model is the result of sum of the two random variables/vectors, i.e. a function of a random vector/variable, and the noise. The random variable/vector \Theta_f has a pdf with its own parameters:


where w is a vector of parameters. The assumption on the probability distribution of the parameters of a generative model enforces a prior on the model’s parameters. This prior does not depend on the data or the noise.

The probability distribution of noise is usually/initially assumed. What remains unknown are the probability distributions of Y and \Theta_{f}. In this regard, the joint probability distribution and hence the joint pdf of these two unknown random variables/vectors can be considered at each index x:

p_{Y,\Theta_{f}}(y,\theta_{f}|w,\theta_{\epsilon},M) \quad \text{or}\tag{8}\\ \ \\ p_{Y,\Theta_{f}}(y,\theta_{f}|x,w,\theta_{\epsilon},M)\quad \text{or}\\ \ \\ p_{Y,\Theta_{f}}(y(x),\theta_{f}|w,\theta_{\epsilon},M)

Using the Eq. 3 (Bayes’ theorem), we write:


where the assumed/preset parameters (in Eq. 8) are integrated into the model information.

The term p(y|\theta_f,M), namely the likelihood of data when observed, can be written in terms of the noise pdf if the noise random vector/variable is assumed to be independent of the parameters. It is usually the case. Denoting the cumulative distribution function by F_X(.), and considering the eq. 6 at one fixed indexing point x, we can write:

\begin{aligned} Y=Y_s+\boldsymbol \epsilon\quad \text{s.t}\quad Y_s=g(\Theta),\quad \Theta\equiv\Theta_f, \quad \Theta\text{ and } \boldsymbol \epsilon\text{ are independent} \\ \ \\ F_{Y|\Theta}(y|\Theta=\theta)=P(Y \le y|\Theta=\theta)=P(g(\Theta)+\boldsymbol \epsilon \le y|\Theta=\theta)=\\ \ \\ \frac{P(g(\Theta)+\boldsymbol \epsilon \le y,\Theta=\theta)}{P(\Theta=\theta)}=\frac{P(\boldsymbol \epsilon \le y-g(\theta),\Theta=\theta)}{P(\Theta=\theta)}=\frac{P(\boldsymbol \epsilon \le y-g(\theta))P(\Theta=\theta)}{P(\Theta=\theta)}=\\ \ \\ P(\boldsymbol \epsilon \le y-g(\theta))=F_{\boldsymbol \epsilon}(y-g(\theta)) \Rightarrow\\ \ \\ p_{Y|\Theta}(y|\theta)\equiv p(y|\theta)=p_{\boldsymbol \epsilon}(y-g(\theta))\\ \ \\ \text{Resuming the notations and considering the index x:} \\ \ \\ p(y|\theta_f,x,w,\theta_{\epsilon},M)=p_{\boldsymbol \epsilon}(y-f(x;\theta_f)) \end{aligned} \tag{10}

It is worthy of note that if the parameter vector \Theta_f is not a random vector (as used in the model in Eq. 5), but just a deterministic parameter, denoted by \theta_f, then at a particular point indexed by x, we have Y=g(\theta_f)+\boldsymbol \epsilon. This Y different from the random vector Y=Y_s+\boldsymbol \epsilon, therefore we denote it by Y^*. For a deterministic \theta_f, we have p_{\boldsymbol \epsilon}(y^*-g(\theta)), because:

\begin{aligned} Y^*=g(\theta)+\boldsymbol \epsilon \Rightarrow F_{Y^*}(y^*)=P(g(\theta)+\boldsymbol \epsilon\le y^*)=P(\boldsymbol \epsilon \le y^*-g(\theta))=F_{\boldsymbol \epsilon} (y^*-g(\theta)) \\ \ \\ \text{implying}\quad p_{Y^*}(y^*)=p_{\boldsymbol \epsilon}(y^*-g(\theta)) \end{aligned} \tag{11}

As \theta becomes the parameter of the function p_{Y^*}(y^*), we can write p_{Y^*}(y^*|\theta)= p_{\boldsymbol \epsilon}(y^*-g(\theta)). Comparing with the case of probabilistic parameter, we can say that for a given value of \Theta, the random vector Y becomes Y^*, i.e Y|\Theta=\theta with the the pdf as in Eq. 11. In other words,

\Theta=\theta\quad\Rightarrow\quad Y|(\Theta=\theta):=Y^*\\ \ \\ p_{Y|\Theta}(y|\theta)=p_{Y*}(y^*)=p_{\boldsymbol \epsilon}(y^*-g(\theta))

Note that for any two random vectors X and Y, the term Y|X=x (or vice versa) is another random vector like Z:=Y|X=x with the support \{z=(s,y):s=x,\text{ and }, y\in \text{support of Y}\}.

In any of the approaches on the parameters, Y or [/latex]Y^*[/latex] are to model the data. Therefore, sample of observed values of data are substituted for their values in order to find the model parameters.

It is common to assume a (multivariate) Gaussian distribution for the noise, \mathscr N(0,\Sigma_{\epsilon \epsilon}). The noise distribution now has a single parameter \Sigma_{\epsilon \epsilon} . This implies that the covariance matrix of noise is diagonal. Also, it is assumed that the noises (the random vector) of all measurements (a vector of measurements indexed by x) are iid, meaning that they have the same variance, i.e. identically distributed, and probabilisticly independent of each other, i.e. independent. In this regard (for y or y^*):

p_{\boldsymbol \epsilon}(y-g(\theta))=\frac{1}{(2\pi)^{n/2}|\Sigma_{\epsilon \epsilon}|^{\frac{1}{2}}}\exp(-\frac{1}{2}(y-f(x;\theta_{f}))^\text T \Sigma_{\epsilon \epsilon}^{-1}(y-f(x;\theta_{f}))

and with the iid assumption for the noise, its covariance matrix becomes an diagonal and isotropic having \sigma_{\epsilon}^2 as its diagonal elements:

p_{\boldsymbol \epsilon}(y-g(\theta))=\frac{1}{(2\pi \sigma_{\epsilon}^2)^{n/2} }\exp(-\frac{1}{2\sigma_{\epsilon}^2}(y-f(x;\theta_{f}))^\text T (y-f(x;\theta_{f}))

One of the simplest models is a straight line; it simulates 1D data, like temperatures measured at different moments. The model is

\mathscr N(0,\Sigma_{\epsilon \epsilon})

To be reviewed and continued

Probability and Stats Cheat Sheet (1)

0- Conventions

1- r-Vectors in matrix operations are considered as column vectors (matrices):

x\in \mathbb{R}^r\equiv\begin{bmatrix}x_1\\x_2\\x_3\\ \vdots \\ x_r \end{bmatrix}\in\mathbb{R}^{r\times 1} \implies x^{\rm T}=\begin{bmatrix}x_1&x_2&x_3& \dots & x_r \end{bmatrix}

2- Column vector inner/dot product:

x.y \equiv (x)^\text T(y)=(y)^\text T(x)=x^\text Ty=y^\text Tx\ \in \mathbb{R} \ \forall x,y\in \mathbb{R}^{r\times 1}

1- Population expected value and covariance matrix

1- X a random r-vector [X_1, \dots, X_r]^\text{T} :

\mu_X= \text{E[X]}:=[\text E[X_1],\dots,E[X_r]]^\text T=[\mu_1,\dots,\mu_r]^\text T\\ \ \\ \text{E}[X^\text T]:=(\text E[X])^\text T \Sigma_{XX}:=\underset{\color{blue}{\text or\ Var(X)}}{\text{Cov}}(X,X):=\text E\big[(X-\mu_X)(X-\mu_X)^\text T\big]\ =\sigma_{ij}^2 \in\mathbb{R}^{r\times r}

where, \sigma_{ij}^2=\text{var}(X_i)=\text E[(X_i-\mu_i)^2] \ \text { for }i=j, and \sigma_{ij}^2=\text{cov}(X_i,X_j)=\text E[(X_i-\mu_i) (X_j-\mu_j) ] \ \text { for } i\ne j.

\Sigma_{XX}=\text E[XX^\text T]-\mu_X\mu_X^\text T \Sigma_{XX}=\text E[XX^\text T] \iff \mu_X=0

2- For random vectors X\in \mathbb{R}^{r\times 1} and Y\in\mathbb{R}^{s\times 1}:

\Sigma_{XY}:=\text{Cov}(X,Y):=\text E\big [(X-\mu_X)(Y-\mu_Y)^\text T\big]=\Sigma_{YX}^\text T\ \in\mathbb{R}^{r\times s} \Sigma_{XY}=\text E\big [XY^\text T\big] \iff \mu_X=\mu_Y=0

Remark: \Sigma_{XX} is a symmetric matrix but \Sigma_{XY} is not necessarily symmetric.

3- Y \in \mathbb {R}^{s \times 1} linearly related to X \in \mathbb {R}^{r \times 1} as Y=AX+b with A\in \mathbb {R}^{s \times r} and b\in \mathbb {R}^{s \times 1} a constant:

\mu_Y=A\mu_X + b \ \in \mathbb{R}^{s \times 1}\\ \ \\ \Sigma_{YY}=A\Sigma_{XX} A^\text T\ \in \mathbb{R}^{s \times s}

4- X,\ Y\in \mathbb {R}^{r \times 1} and Z\in \mathbb {R}^{s \times 1}:

\text {Cov}(X+Y,Z)=\Sigma_{XZ}+\Sigma_{XY}\\ \ \\ \text {Cov}(X+Y,X+Y)=\Sigma_{XX}+\Sigma_{XY}+\Sigma_{YX}+\Sigma_{YY}

5- X \in \mathbb {R}^{r \times 1}, Y\in \mathbb {R}^{s \times 1}, A\in \mathbb {R}^{r \times s} and B\in \mathbb {R}^{r \times s}:

\text {Cov}(AX,BY)=A\Sigma_{XY}B^\text T\ \in \mathbb {R}^{r \times s}

For the proof, expand \text E[(AX-\text E [AX])(BY-\text E [BY])^\text T].

In a special case:

\text {Var}(AX)=\text {Cov}(AX,AX)=A\Sigma_{XX}A^\text T\ \in \mathbb {R}^{r \times r}

6- \Sigma_{XX} is positive semi-definite. Proof: show that \forall u\in \mathbb{R}^{r\times 1},\ u^\text T\Sigma_{XX}u \ge 0. Use 5 in the proof.

u^\text T\Sigma_{XX}u=\text{Cov}(u^\text TX, u^\text TX)=\text{Cov}(u^\text TX, (u^\text TX)^\text T)=\text{Var}(Z)\ge 0\\ \ \\ \qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad Z:=u^\text TX \in \mathbb{R}

7- A random vector X is centred iff \mu_X=0. Any random vector can be centred by the transformation X_c=X-\mu_x. The following expressions are useful (proof is by expansion/writing the terms and induction):

\forall X\in \mathbb{R}^{r\times 1}, Y\in \mathbb{R}^{t\times 1}\\ \ \\ \Sigma_{XY}=\text E\big [(X-\mu_X)(Y-\mu_Y)^\text T\big]=\text E\big [X_cY_c^\text T\big]=\Sigma_{X_cY_c}\\ \ \\ \Sigma_{XX}=\text E\big [(X-\mu_X)(X-\mu_X)^\text T\big]=\text E\big [X_cX_c^\text T\big]=\Sigma_{X_cX_c} \text E[X_c^\text TX_c]=\text {tr}(\Sigma_{XX}) \text E[X_c^\text TC Y_c]=\text {tr}(C\Sigma_{XY}) \qquad C\in \mathbb{R}^{r\times t} \text E[Y_c^\text T C^\text T C Y_c]=\text {tr}(C\Sigma_{YY}C^\text T)

Above expressions involving the trace can have different form according to the properties of the trace.

8- Conditional probability density function:

Miscellaneous Proofs and Theorems in Probability and Statistics (1)

1- A minimization problem regarding PCA:

Given that X,\mu \in \mathbb{R}^{r\times 1},\ A \in \mathbb{R}^{r\times t},\ B \in \mathbb{R}^{t\times r}, t\le r, and A and B are both of full rank t, prove that the minimizing parameters, \{\mu,A,B\} of

\text E\big [(X-\mu – ABX)^T(X-\mu – ABX)\big ]\ \in \mathbb{R}^+


A=\begin{bmatrix}|&|& \dots & |\\v_1 &v_2 &\dots &v_t\\|&|& \dots & | \end{bmatrix}=B^\text T\\ \ \\ \mu=(I_r-AB)\mu_X

where v_1,\dots v_t are the first t eigenvectors of \Sigma_{XX} such that \forall v_i,v_j\ i\lt j \iff \lambda_i(\Sigma_{XX})\lt \lambda_j(\Sigma_{XX}) (this indicates how the eigenvalues and eigenvectors are sorted). \lambda_i(\Sigma_{XX} ) is the i-th eigenvalue of \Sigma_{XX}; and I_r\in \mathbb{R}^{r\times r} is the identity matrix.

Let’s write C:=AB and X=X_c+\mu_x s.t X_c is the centred X, and do the expansion as follows:

\begin{aligned} \text E\big [(X_c+\mu_x-\mu – C(X_c+\mu_x))^T(X_c+\mu_x-\mu – C(X_c+\mu_x))\big]\\ =\text E\big[ (X_c-CX_c)^\text T(X_c-CX_c)\big]+\text E\big [ (X_c-CX_c)^\text T\big](\mu_x-\mu-C\mu_x) \\+(\mu_x-\mu-C\mu_x)^\text T \text E\big [ (X_c-CX_c) \big] \\+\text E\big [(\mu_x-\mu-C\mu_x)^\text T(\mu_x-\mu-C\mu_x)\big] \end{aligned}

As \text E[X_c]=0 and the expected value is on a constant term in the 4th term, we can write:

=\text E\big[ (X_c-CX_c)^\text T(X_c-CX_c)\big]+ (\mu_x-\mu-C\mu_x)^\text T(\mu_x-\mu-C\mu_x)

The resultant expression contains positive terms. A necessary condition to minimize this expression is to make the 2nd term attains its minimum, i.e. zero. Therefore, \mu=\mu_x-C\mu_x=(I_r-C)\mu_x.

Now, we have to find C such that it minimizes the 1st term, i.e.:

E\big[ (X_c-CX_c)^\text T(X_c-CX_c)\big]

Expanding the expression, we get the above as:

\begin{aligned} =\text E\big[(X_c^\text T X_c- X_c^\text T CX_c- X_c^\text T C^\text T X_c+ X_c^\text T C^\text T CX_c\big] \end{aligned}

By the relation to the covariance matrix (see this post), above becomes:

=\text {tr}(\Sigma_{XX}-C\Sigma_{XX}-C^\text T\Sigma_{XX}+C\Sigma_{XX}C^\text T)\ \in \mathbb{R}^+

By the properties of trace, i.e. \text {tr}(A+B)= \text {tr} (A)+ \text {tr} (B) and \text {tr}(AB)= \text {tr}(BA), we have:

=\text {tr}(\Sigma_{XX}-C\Sigma_{XX}-\Sigma_{XX}C^\text T+C\Sigma_{XX}C^\text T)

As \Sigma_{XX}=\Sigma_{XX}^\text T, and by the definition of matrix power, we can factor the above expression as:

=\text {tr}\big(\Sigma_{XX}-\Sigma_{XX}+(C\Sigma_{XX}^{\frac{1}{2}}-\Sigma_{XX}^{\frac{1}{2}})(C\Sigma_{XX}^{\frac{1}{2}}-\Sigma_{XX}^{\frac{1}{2}})^\text T\big)\\ \ \\ =\text {tr}\big((C\Sigma_{XX}^{\frac{1}{2}}-\Sigma_{XX}^{\frac{1}{2}})(C\Sigma_{XX}^{\frac{1}{2}}-\Sigma_{XX}^{\frac{1}{2}})^\text T\big)

The problem is now to find C that minimizes:

\text {tr}\big((\Sigma_{XX}^{\frac{1}{2}}-C\Sigma_{XX}^{\frac{1}{2}})(\Sigma_{XX}^{\frac{1}{2}}-C\Sigma_{XX}^{\frac{1}{2}})^\text T\big)

which is equal to:


with \|.\|_F being the Frobenius norm of matrices.

Considering the following facts:

1- If A\in\mathbb{R}^{r\times r} is symmetric, it is diagonalizable and for q\in \mathbb{Q} the eignevalues of A and A^q are related as \lambda_i(A^q)=\lambda_i^q(A). The eigenvectors of A and A^q are the same.

2- \Sigma_{XX}^{\frac{1}{2}} is symmetric and has the same rank, r, as \Sigma_{XX}\in \mathbb{R}^{r\times r}.

3- C=AB has the rank at most t\le r. therefore, C\Sigma_{XX}^{\frac{1}{2}} has the rank at most t.

4- A partial/truncated singular value decomposition (SVD) of a matrix A \in \mathbb{R}^{m\times n} is A_k=\sum_{i=1}^{k}\sigma_i u_i v_i^\text T s.t u_i, v_i and \sigma_i are from the SVD of A, i.e. A=U\Sigma V^\text T. If A is symmetric, then A=V \varLambda V^\text T and A_k=\sum_{i=1}^{k}\lambda_i v_i v_i^\text T s.t \lambda_i is an eigenvalue of A. Note that the \lambda_i decreases with i.

5- As a corollary of Eckart-Young-Mirsky Theorem: Let A,B\ \in \mathbb{R}^{m\times n} with \text {rank}(B)=k\le \text {rank}(A) \le \text{min}(m,n), then A_k=\argmin_{B} \|A-B\|_F.

we can write:

\argmin_B \|\Sigma_{XX}^{\frac{1}{2}}-B\|_F=\sum_{i=1}^{t}\lambda_i^{\frac{1}{2}} v_i v_i^\text T\\ \ \\ \implies C\Sigma_{XX}^{\frac{1}{2}}=\sum_{i=1}^{t}\lambda_i^{\frac{1}{2}} v_i v_i^\text T\\ \ \\ \Sigma_{XX}\text{ being positive definite} \implies C=(\sum_{i=1}^{t}\lambda_i^{\frac{1}{2}} v_i v_i^\text T)\Sigma_{XX}^{-\frac{1}{2}}\tag{1}

where \lambda_i=\lambda_i(\Sigma_{XX}), and v_i is the associated (normalized) eigenvector of \Sigma_{XX}.

As \Sigma_{XX}^\frac{1}{2}=V\varLambda^\frac{1}{2}V^\text T with \varLambda the diagonal matrix of eigenvalues of \Sigma_{XX} and V the eigenvectors’ unitary matrix, we can write: \varLambda^{-\frac{1}{2}} V^\text T\Sigma_{XX}^\frac{1}{2}=V^\text T. Therefore:

v_i^\text T=\lambda_i^{-\frac{1}{2}}v_i^\text T\Sigma_{XX}^\frac{1}{2}

Substituting the above in Eq. 1, we’ll get:

C=\sum_{i=1}^{t} v_i v_i^\text T=\begin{bmatrix}|&|& \dots & |\\v_1 &v_2 &\dots &v_t\\|&|& \dots & | \end{bmatrix} \begin{bmatrix}\text{—}v_1\text{—}\\\text{—}v_2\text{—}\\ \dots\\ \text{—}v_t\text{—} \end{bmatrix}=V^*V^{*\text T}

As C=AB, we accomplish the proof by letting A=V^*.

2- Eckart-Young-Mirsky Theorem

Let A,B\ \in \mathbb{R}^{m\times n} with \text {rank}(B)=k\le \text {rank}(A) \le \text{min}(m,n), then \|A-A_k\|_F \le \|A-B\|_F s.t A_k is the partial SVD of A i.e A_k=\Sigma_{i=1}^{k}\sigma_i u_i v_i^\text T.