Miscellaneous Proofs and Theorems in Probability and Statistics (1)

1- A minimization problem regarding PCA:

Given that X,\mu \in \mathbb{R}^{r\times 1},\ A \in \mathbb{R}^{r\times t},\ B \in \mathbb{R}^{t\times r}, t\le r, and A and B are both of full rank t, prove that the minimizing parameters, \{\mu,A,B\} of

\text E\big [(X-\mu – ABX)^T(X-\mu – ABX)\big ]\ \in \mathbb{R}^+

are:

A=\begin{bmatrix}|&|& \dots & |\\v_1 &v_2 &\dots &v_t\\|&|& \dots & | \end{bmatrix}=B^\text T\\ \ \\ \mu=(I_r-AB)\mu_X

where v_1,\dots v_t are the first t eigenvectors of \Sigma_{XX} such that \forall v_i,v_j\ i\lt j \iff \lambda_i(\Sigma_{XX})\lt \lambda_j(\Sigma_{XX}) (this indicates how the eigenvalues and eigenvectors are sorted). \lambda_i(\Sigma_{XX} ) is the i-th eigenvalue of \Sigma_{XX}; and I_r\in \mathbb{R}^{r\times r} is the identity matrix.

Let’s write C:=AB and X=X_c+\mu_x s.t X_c is the centred X, and do the expansion as follows:

\begin{aligned} \text E\big [(X_c+\mu_x-\mu – C(X_c+\mu_x))^T(X_c+\mu_x-\mu – C(X_c+\mu_x))\big]\\ =\text E\big[ (X_c-CX_c)^\text T(X_c-CX_c)\big]+\text E\big [ (X_c-CX_c)^\text T\big](\mu_x-\mu-C\mu_x) \\+(\mu_x-\mu-C\mu_x)^\text T \text E\big [ (X_c-CX_c) \big] \\+\text E\big [(\mu_x-\mu-C\mu_x)^\text T(\mu_x-\mu-C\mu_x)\big] \end{aligned}

As \text E[X_c]=0 and the expected value is on a constant term in the 4th term, we can write:

=\text E\big[ (X_c-CX_c)^\text T(X_c-CX_c)\big]+ (\mu_x-\mu-C\mu_x)^\text T(\mu_x-\mu-C\mu_x)

The resultant expression contains positive terms. A necessary condition to minimize this expression is to make the 2nd term attains its minimum, i.e. zero. Therefore, \mu=\mu_x-C\mu_x=(I_r-C)\mu_x.

Now, we have to find C such that it minimizes the 1st term, i.e.:

E\big[ (X_c-CX_c)^\text T(X_c-CX_c)\big]

Expanding the expression, we get the above as:

\begin{aligned} =\text E\big[(X_c^\text T X_c- X_c^\text T CX_c- X_c^\text T C^\text T X_c+ X_c^\text T C^\text T CX_c\big] \end{aligned}

By the relation to the covariance matrix (see this post), above becomes:

=\text {tr}(\Sigma_{XX}-C\Sigma_{XX}-C^\text T\Sigma_{XX}+C\Sigma_{XX}C^\text T)\ \in \mathbb{R}^+

By the properties of trace, i.e. \text {tr}(A+B)= \text {tr} (A)+ \text {tr} (B) and \text {tr}(AB)= \text {tr}(BA), we have:

=\text {tr}(\Sigma_{XX}-C\Sigma_{XX}-\Sigma_{XX}C^\text T+C\Sigma_{XX}C^\text T)

As \Sigma_{XX}=\Sigma_{XX}^\text T, and by the definition of matrix power, we can factor the above expression as:

=\text {tr}\big(\Sigma_{XX}-\Sigma_{XX}+(C\Sigma_{XX}^{\frac{1}{2}}-\Sigma_{XX}^{\frac{1}{2}})(C\Sigma_{XX}^{\frac{1}{2}}-\Sigma_{XX}^{\frac{1}{2}})^\text T\big)\\ \ \\ =\text {tr}\big((C\Sigma_{XX}^{\frac{1}{2}}-\Sigma_{XX}^{\frac{1}{2}})(C\Sigma_{XX}^{\frac{1}{2}}-\Sigma_{XX}^{\frac{1}{2}})^\text T\big)

The problem is now to find C that minimizes:

\text {tr}\big((\Sigma_{XX}^{\frac{1}{2}}-C\Sigma_{XX}^{\frac{1}{2}})(\Sigma_{XX}^{\frac{1}{2}}-C\Sigma_{XX}^{\frac{1}{2}})^\text T\big)

which is equal to:

\|\Sigma_{XX}^{\frac{1}{2}}-C\Sigma_{XX}^{\frac{1}{2}}\|_F

with \|.\|_F being the Frobenius norm of matrices.

Considering the following facts:

1- If A\in\mathbb{R}^{r\times r} is symmetric, it is diagonalizable and for q\in \mathbb{Q} the eignevalues of A and A^q are related as \lambda_i(A^q)=\lambda_i^q(A). The eigenvectors of A and A^q are the same.

2- \Sigma_{XX}^{\frac{1}{2}} is symmetric and has the same rank, r, as \Sigma_{XX}\in \mathbb{R}^{r\times r}.

3- C=AB has the rank at most t\le r. therefore, C\Sigma_{XX}^{\frac{1}{2}} has the rank at most t.

4- A partial/truncated singular value decomposition (SVD) of a matrix A \in \mathbb{R}^{m\times n} is A_k=\sum_{i=1}^{k}\sigma_i u_i v_i^\text T s.t u_i, v_i and \sigma_i are from the SVD of A, i.e. A=U\Sigma V^\text T. If A is symmetric, then A=V \varLambda V^\text T and A_k=\sum_{i=1}^{k}\lambda_i v_i v_i^\text T s.t \lambda_i is an eigenvalue of A. Note that the \lambda_i decreases with i.

5- As a corollary of Eckart-Young-Mirsky Theorem: Let A,B\ \in \mathbb{R}^{m\times n} with \text {rank}(B)=k\le \text {rank}(A) \le \text{min}(m,n), then A_k=\argmin_{B} \|A-B\|_F.

we can write:

\argmin_B \|\Sigma_{XX}^{\frac{1}{2}}-B\|_F=\sum_{i=1}^{t}\lambda_i^{\frac{1}{2}} v_i v_i^\text T\\ \ \\ \implies C\Sigma_{XX}^{\frac{1}{2}}=\sum_{i=1}^{t}\lambda_i^{\frac{1}{2}} v_i v_i^\text T\\ \ \\ \Sigma_{XX}\text{ being positive definite} \implies C=(\sum_{i=1}^{t}\lambda_i^{\frac{1}{2}} v_i v_i^\text T)\Sigma_{XX}^{-\frac{1}{2}}\tag{1}

where \lambda_i=\lambda_i(\Sigma_{XX}), and v_i is the associated (normalized) eigenvector of \Sigma_{XX}.

As \Sigma_{XX}^\frac{1}{2}=V\varLambda^\frac{1}{2}V^\text T with \varLambda the diagonal matrix of eigenvalues of \Sigma_{XX} and V the eigenvectors’ unitary matrix, we can write: \varLambda^{-\frac{1}{2}} V^\text T\Sigma_{XX}^\frac{1}{2}=V^\text T. Therefore:

v_i^\text T=\lambda_i^{-\frac{1}{2}}v_i^\text T\Sigma_{XX}^\frac{1}{2}

Substituting the above in Eq. 1, we’ll get:

C=\sum_{i=1}^{t} v_i v_i^\text T=\begin{bmatrix}|&|& \dots & |\\v_1 &v_2 &\dots &v_t\\|&|& \dots & | \end{bmatrix} \begin{bmatrix}\text{—}v_1\text{—}\\\text{—}v_2\text{—}\\ \dots\\ \text{—}v_t\text{—} \end{bmatrix}=V^*V^{*\text T}

As C=AB, we accomplish the proof by letting A=V^*.

2- Eckart-Young-Mirsky Theorem

Let A,B\ \in \mathbb{R}^{m\times n} with \text {rank}(B)=k\le \text {rank}(A) \le \text{min}(m,n), then \|A-A_k\|_F \le \|A-B\|_F s.t A_k is the partial SVD of A i.e A_k=\Sigma_{i=1}^{k}\sigma_i u_i v_i^\text T.