 # Miscellaneous Proofs and Theorems in Probability and Statistics (1)

### 1- A minimization problem regarding PCA:

Given that , , and and are both of full rank t, prove that the minimizing parameters, of

\text E\big [(X-\mu – ABX)^T(X-\mu – ABX)\big ]\ \in \mathbb{R}^+

are:

A=\begin{bmatrix}|&|& \dots & |\\v_1 &v_2 &\dots &v_t\\|&|& \dots & | \end{bmatrix}=B^\text T\\ \ \\ \mu=(I_r-AB)\mu_X

where are the first eigenvectors of such that (this indicates how the eigenvalues and eigenvectors are sorted). is the i-th eigenvalue of ; and is the identity matrix.

Let’s write and s.t is the centred , and do the expansion as follows:

\begin{aligned} \text E\big [(X_c+\mu_x-\mu – C(X_c+\mu_x))^T(X_c+\mu_x-\mu – C(X_c+\mu_x))\big]\\ =\text E\big[ (X_c-CX_c)^\text T(X_c-CX_c)\big]+\text E\big [ (X_c-CX_c)^\text T\big](\mu_x-\mu-C\mu_x) \\+(\mu_x-\mu-C\mu_x)^\text T \text E\big [ (X_c-CX_c) \big] \\+\text E\big [(\mu_x-\mu-C\mu_x)^\text T(\mu_x-\mu-C\mu_x)\big] \end{aligned}

As and the expected value is on a constant term in the 4th term, we can write:

=\text E\big[ (X_c-CX_c)^\text T(X_c-CX_c)\big]+ (\mu_x-\mu-C\mu_x)^\text T(\mu_x-\mu-C\mu_x)

The resultant expression contains positive terms. A necessary condition to minimize this expression is to make the 2nd term attains its minimum, i.e. zero. Therefore, .

Now, we have to find such that it minimizes the 1st term, i.e.:

E\big[ (X_c-CX_c)^\text T(X_c-CX_c)\big]

Expanding the expression, we get the above as:

\begin{aligned} =\text E\big[(X_c^\text T X_c- X_c^\text T CX_c- X_c^\text T C^\text T X_c+ X_c^\text T C^\text T CX_c\big] \end{aligned}

By the relation to the covariance matrix (see this post), above becomes:

=\text {tr}(\Sigma_{XX}-C\Sigma_{XX}-C^\text T\Sigma_{XX}+C\Sigma_{XX}C^\text T)\ \in \mathbb{R}^+

By the properties of trace, i.e. and , we have:

=\text {tr}(\Sigma_{XX}-C\Sigma_{XX}-\Sigma_{XX}C^\text T+C\Sigma_{XX}C^\text T)

As , and by the definition of matrix power, we can factor the above expression as:

=\text {tr}\big(\Sigma_{XX}-\Sigma_{XX}+(C\Sigma_{XX}^{\frac{1}{2}}-\Sigma_{XX}^{\frac{1}{2}})(C\Sigma_{XX}^{\frac{1}{2}}-\Sigma_{XX}^{\frac{1}{2}})^\text T\big)\\ \ \\ =\text {tr}\big((C\Sigma_{XX}^{\frac{1}{2}}-\Sigma_{XX}^{\frac{1}{2}})(C\Sigma_{XX}^{\frac{1}{2}}-\Sigma_{XX}^{\frac{1}{2}})^\text T\big)

The problem is now to find that minimizes:

\text {tr}\big((\Sigma_{XX}^{\frac{1}{2}}-C\Sigma_{XX}^{\frac{1}{2}})(\Sigma_{XX}^{\frac{1}{2}}-C\Sigma_{XX}^{\frac{1}{2}})^\text T\big)

which is equal to:

\|\Sigma_{XX}^{\frac{1}{2}}-C\Sigma_{XX}^{\frac{1}{2}}\|_F

with being the Frobenius norm of matrices.

Considering the following facts:

1- If is symmetric, it is diagonalizable and for the eignevalues of and are related as . The eigenvectors of and are the same.

2- is symmetric and has the same rank, r, as .

3- has the rank at most . therefore, has the rank at most t.

4- A partial/truncated singular value decomposition (SVD) of a matrix is s.t and are from the SVD of A, i.e. . If is symmetric, then and s.t is an eigenvalue of . Note that the decreases with .

5- As a corollary of Eckart-Young-Mirsky Theorem: Let with , then .

we can write:

\argmin_B \|\Sigma_{XX}^{\frac{1}{2}}-B\|_F=\sum_{i=1}^{t}\lambda_i^{\frac{1}{2}} v_i v_i^\text T\\ \ \\ \implies C\Sigma_{XX}^{\frac{1}{2}}=\sum_{i=1}^{t}\lambda_i^{\frac{1}{2}} v_i v_i^\text T\\ \ \\ \Sigma_{XX}\text{ being positive definite} \implies C=(\sum_{i=1}^{t}\lambda_i^{\frac{1}{2}} v_i v_i^\text T)\Sigma_{XX}^{-\frac{1}{2}}\tag{1}

where , and is the associated (normalized) eigenvector of .

As with the diagonal matrix of eigenvalues of and the eigenvectors’ unitary matrix, we can write: . Therefore:

v_i^\text T=\lambda_i^{-\frac{1}{2}}v_i^\text T\Sigma_{XX}^\frac{1}{2}

Substituting the above in Eq. 1, we’ll get:

C=\sum_{i=1}^{t} v_i v_i^\text T=\begin{bmatrix}|&|& \dots & |\\v_1 &v_2 &\dots &v_t\\|&|& \dots & | \end{bmatrix} \begin{bmatrix}\text{—}v_1\text{—}\\\text{—}v_2\text{—}\\ \dots\\ \text{—}v_t\text{—} \end{bmatrix}=V^*V^{*\text T}

As , we accomplish the proof by letting .

### 2- Eckart-Young-Mirsky Theorem

Let with , then s.t is the partial SVD of i.e .