# Linear Algebra Cheat Sheet (1)

#### 0- Notations and convention

A variable/quantity/element can be scalar, vector, or a matrix; its type should be understood from the context if not explicitly declared.

0-1- A vector is considered as a column vector, i.e. a single-column matrix. Therefore:

z\in \mathbb{F}^{n}\equiv z\in \mathbb{F}^{n\times 1}

A row vector is therefore defined as .

0-2- Dot product of column vectors:

\text{For }x,y\in \mathbb{R}^{n\times 1}, \ x\cdot y:=x^\text Ty

0-3- The ij-th element of a matrix is denoted by . The i-th column and j-th row of the matrix are respectively denoted by and .

0-4- Columns and rows of a matrix are column and row vectors. A -matrix can then be represented as:

\begin{bmatrix} v_1& v_2& \dots & v_n\end{bmatrix}\ \text{ s.t } v_i=A_{,i}

or

\begin{bmatrix}u_1^\text T\\ u_2^\text T\\ \vdots\\ u_n^\text T\end{bmatrix}\ \text{ s.t } u_i^\text T=A_{i,}

#### 1- Orthogonal matrix

a square matrix is said (definition) to be orthogonal iff ; provided that , inverse of exists. As a result, .

The following are equivalent for :
a) is orthogonal.
b) the column vectors of are ortho-normal.
c) the row vectors of are ortho-normal.
d) is size preserving: , being the Euclidean norm, and .
e) is dot product preserving: .

#### 2- Some matrix multiplication identities

A\in \mathbb{F}^{m\times n},\ x\in \mathbb{F}^{n\times 1}\ \text{ then } Ax=\sum_{i=1}^n x_iA_{,i} A\in \mathbb{F}^{m\times n}\ , D\in \mathbb{F}^{n\times n}\ \text{ and diagonal }, B\in \mathbb{F}^{n\times k},\ \text{ s.t.}\\ \ \\ A=\begin{bmatrix} v_1& v_2& \dots & v_n\end{bmatrix}, \ B=\begin{bmatrix}u_1^\text T\\ \\ u_2^\text T\\ \vdots\\ u_n^\text T\end{bmatrix},\ D_{ij}= \begin{cases} \lambda_i &\text{, } i=j \\ 0 &\text{, } i\ne j \end{cases} \\ \ \\ \text{then }ADB=\sum_{i=1}^n \lambda_iv_i u_i^\text T

#### 3- Change of basis

let and be two basis of . Then the following holds for the coordinates of a vector with respect to the two bases:

[v]_{\beta’}=P_{\beta\to\beta’}[v]_{\beta}\quad\text{s.t} \\ \ \\ P_{\beta\to\beta’}=\begin{bmatrix} [b_1]_{\beta’}& [b_2]_{\beta’}& \dots & [b_n]_{\beta’}\end{bmatrix}

It can be proved that:

# Rank of a Matrix

Let (the results also holds for ). Then, the column rank/row rank of A is defined as the dimension of the column/row space of A, i.e. the dimension of the vector space spanned by the columns/rows of A; this is then equivalent to the number of linearly independent columns/rows (column/rows vector) of A.

Theorem: column rank of A = row rank of A.

Definition: the rank of a matrix, rank(A), is the dimension of either the column or row space of A; simply the number of linearly independent columns or rows of A.

Definition: for a linear map , the rank of the linear map is defined as the dimension of the image of . This definition is equivalent to the definition of the matrix rank as every linear map has a matrix by which it can be written as .

Proposition: . This leads to these definitions: A matrix is said to be full rank iff , i.e. the largest possible rank, and it is said to be rank deficient iff , i.e. not having full rank.

## Properties of rank

For :

1- only a zero matrix has rank zero.

2- If , then

3-

4-

5-

6- If , then for , . In addition, for , , i.e. has at most rank n.

7- A square matrix can be decomposed as where is a diagonal matrix containing the eigenvalues of . Then, , i.e. the number of non-zero eigenvalues of .

8- For a square matrix , then equivalently is full rank, is invertible, has non-zero determinant, and has n non-zero eigenvalues.

## Proofs

P6:

V=\begin{bmatrix}v_1\begin{bmatrix}v_1\\v_2\\ \vdots\\v_r \end{bmatrix}&v_2\begin{bmatrix}v_1\\v_2\\ \vdots\\v_r \end{bmatrix}&\dots &v_r\begin{bmatrix}v_1\\v_2\\ \vdots\\v_r \end{bmatrix}\end{bmatrix}

where are coordinates of the vector . This indicates that each column of is a scalar multiple of any other columns of ; therefore, the column space is one dimensional. Hence, .

For the second part, property 3 proves the statement.

# SSM Using Scalismo (4): Generalized Procrustes analyses

In this post, I am going to implement the partial and the full generalized Procrustes analyses (partial GPA and full GPA) with Scalismo. Before getting into the implementation, I may want read this post explaining the theory of GPA’s.

## Partial GPA

Through partial GPA, more than 2 configurations/meshes get fitted/superimposed onto a common (and initially unknown) mean configuration. The procedure is iterative and involves translation and rotation of the meshes.

GPA’s need meshes in correspondence; here, I load the femoral head meshes that I previously had their correspondence established through the non-rigid registration (this post). Initializing and loading the meshes:

This post contains code blocks; if your browser does not automatically display the code blocks, please click on the displayed dark narrow rectangular areas to display the containing code.

object GPA extends App {
import scalismo.geometry._
import scalismo.common._
import scalismo.ui.api._
import scalismo.mesh._
import scalismo.io.{StatisticalModelIO, MeshIO}
import scalismo.statisticalmodel._
import scalismo.registration._
import scalismo.statisticalmodel.dataset._
import java.awt.Color

scalismo.initialize()
implicit val rng = scalismo.utils.Random(42)

val ui = ScalismoUI()
val drive = "I:/"
val meshDir = drive + "registration/Registered/"
val meshFiles = new java.io.File(meshDir).listFiles()
meshFiles.foreach(file => println(s"\nfile")) println(meshFiles) Visualizing the meshes: val (meshes, meshViews) = meshFiles.map(meshFile => { val mesh = MeshIO.readMesh(meshFile).get val meshView = ui.show(dsGroup, mesh, "mesh") (mesh, meshView) // returns a tuple of the mesh and the associated view }) .unzip //unzip the tuple into two separate lists // setting some colors meshViews(0).color = new Color(85,235,40) meshViews(1).color = new Color(20,40,250) meshViews(2).color = new Color(250,150,20) Here, methods are defined to calculate the centroid (center of mass) of a mesh and centering the mesh. This is not necessary for GPA, as it involves translation. Also, GPA method in Scalismo, calculates the centroid of a given reference mesh, and uses it as the center of rotation of the meshes through the process. It does not harm if all the meshes are initially centered. Also, it is consistent with the GPA algorithm explained in the related post. def CM (mesh:TriangleMesh[_3D]): EuclideanVector3D = { val numOfPoints = mesh.pointSet.numberOfPoints var total = EuclideanVector3D(0,0,0) val meshPoints = mesh.pointSet.points //iterator while (meshPoints.hasNext){ //iterator total = total + (meshPoints.next()).toVector } total*(1d/numOfPoints) } val centeredMeshes = meshes.map(mesh => { val translationVec = CM(mesh)*(-1) val translation = TranslationTransform[_3D](translationVec) mesh.transform(translation) }) To do GPA’s with Scalismo, a DataCollection should be defined. A DataCollection needs a reference mesh and a sequence of meshes; all in correspondence. A GPA will be performed on the sequence of the meshes. The reference mesh is used as the domain of transformations. A mean mesh, in Scalismo, is defined as the mean deformations (vectors) of the meshes from the reference mesh, added to the reference mesh; it should be remembered that all the meshes are in correspondence. It does not mater which, but, one of the meshes should be selected as the reference mesh.  // General Procrustes Analysis: Aligns a set of meshes with respect to some unknown “mean” shape (independent of ordering of shapes) // setting a reference mesh. it is needed for data collection but no effect on the GPA val refMesh = centeredMeshes(2) val allMeshes = centeredMeshes val dataClctn = ((DataCollection.fromMeshSequence(refMesh,allMeshes))._1).get // to check how many meshes are in the data collection println(dataClctn.size) Performing the partial GPA (number of iterations = 10, precision of average point-wise distance between meshes (halt distance) = 1e-5): // doing the partial GPA val gpaDataClctn = DataCollection.gpa(dataClctn ,10,1e-5) Retrieving the mean mesh and visualization: // retrieving the mean mesh val meanMesh = gpaDataClctn.meanSurface //CM(meanMesh) // to check whether the mean mesh is centered or not val meanGroup = ui.createGroup("meanMesh") val meanMeshViewg = ui.show( meanGroup, meanMesh,"meanMesh") meanMeshViewg.color = new Color(245,20,50) The transformed versions of the original meshes onto the mean mesh are called the partial Procrustes fits. They can be retrieved and viewed: //aligning by translation and rotation all the meshes onto the mean mesh val meanMeshPoints = meanMesh.pointSet.points.toIndexedSeq val gpaTransforms = for (mesh <- allMeshes) yield { val meshPoints = mesh.pointSet.points.toIndexedSeq LandmarkRegistration.rigid3DLandmarkRegistration(meshPoints.zip(meanMeshPoints), center = Point3D(0, 0, 0)) } val gpaGroup = ui.createGroup("partial Gpa meshes") //partial GPA fits val transformedMeshes = (allMeshes.zip(gpaTransforms)).map(z => { val mesh=z._1 val transToMean = z._2 mesh.transform(transToMean) }) val transfrmdMeshViews = transformedMeshes.map(mesh => ui.show(gpaGroup, mesh, "trMesh")) The centroid size (see this post) of each mesh in this example can be calculated by this method: def centroidSize (mesh:TriangleMesh3D): Double ={ var total = 0d val meshPoints = mesh.pointSet.points //iterator val centroid = CM(mesh) // is a vector here while (meshPoints.hasNext){ //iterator total = total + scala.math.pow(((meshPoints.next().toVector - centroid).norm),2) } scala.math.sqrt(total) } For example:  centroidSize(TransformedMeshes(0)) centroidSize(allMeshes(0)) centroidSize(MeanMesh) The centroid sizes are (unit of length): original mesh 1: 487.222, mesh 1 partial-GPA fit: 487.222 original mesh 2: 692.725, mesh 2 partial- GPA fit : 692.725 original mesh 3: 566.668, mesh 3 partial- GPA fit : 566.668 mean mesh: 577.732 As expected, the original and their partial-GPA fits have similar sizes due to no-scaling in the partial GPA. However, the mean mesh must have a different size; as it has. ## Full GPA The full GPA entails scaling, translation and rotation of the meshes in order to superimpose each mesh onto the (unknown) mean mesh. Scalismo does not currently have a method performing the full GPA, but its partial gpa can be manipulated to do the full GPA. To this end, we should access the package scalismo.statisticalmodel.dataset in the file DataCollection.scala to copy the needed pieces of codes regarding the partial GPA method. This is already done as below. The copied lines of code are used to define new functions methods. The only difference between the method doing the full GPA and the partial one, is that similarity3DLandmarkRegistration method in the former one instead of rigid3DLandmarkRegistration in the latter one. import scalismo.utils.Random def fGpa(dc: DataCollection, maxIteration: Int = 3, haltDistance: Double = 1e-5)(implicit rng: Random): DataCollection = { fGpaComputation(dc, dc.meanSurface, maxIteration, haltDistance) } def fGpaComputation(dc: DataCollection, meanShape: TriangleMesh[_3D], maxIteration: Int, haltDistance: Double)(implicit rng: Random): DataCollection = { if (maxIteration == 0) return dc val referencePoints = dc.reference.pointSet.points.toIndexedSeq val numberOfPoints = referencePoints.size val referenceCenterOfMass = referencePoints.foldLeft(Point3D(0, 0, 0))((acc, pt) => acc + (pt.toVector / numberOfPoints)) val meanShapePoints = meanShape.pointSet.points.toIndexedSeq // align all shape to it and create a transformation from the mean to the aligned shape val dataItemsWithAlignedTransform = dc.dataItems.par.map { dataItem => val surface = dc.reference.transform(dataItem.transformation) val transform = LandmarkRegistration.similarity3DLandmarkRegistration(surface.pointSet.points.toIndexedSeq.zip(meanShapePoints), referenceCenterOfMass) DataItem("gpa -> " + dataItem.info, Transformation(transform.compose(dataItem.transformation))) } val newdc = DataCollection(dc.reference, dataItemsWithAlignedTransform.toIndexedSeq) val newMean = newdc.meanSurface if (MeshMetrics.procrustesDistance(meanShape, newMean) < haltDistance) { newdc } else { fGpaComputation(newdc, newMean, maxIteration - 1, haltDistance) } } Let’s compute the mean mesh and then superimpose (by the similarity transformations) the meshes onto the mean mesh; the code is: val fGpaDataClctn = fGpa(dataClctn ,20,1e-5) val fGpaMeanMesh = fGpaDataClctn.meanSurface //CM(meanMesh) // to check whether the mean mesh is centered or not val fGpaMeanGroup = ui.createGroup("fullGpaMeanMesh") val fGpaMeanMeshView = ui.show( fGpaMeanGroup,fGpaMeanMesh ,"fullGpaMeanMesh") fGpaMeanMeshView.color = new Color(245,20,50) val fGpaGroup = ui.createGroup("Full Gpa Meshes") val fGpaMeanMeshPoint = fGpaMeanMesh.pointSet.points.toIndexedSeq val fullGpaTransforms = for (mesh <-allMeshes) yield { val meshPoints = mesh.pointSet.points.toIndexedSeq // transformations include scaling LandmarkRegistration.similarity3DLandmarkRegistration(meshPoints.zip(fGpaMeanMeshPoint), center = Point3D(0, 0, 0)) } //full GPA fits val fGpaTransformedMeshes = allMeshes.zip(fullGpaTransforms).map (z=> z._1.transform(z._2)) val fGpaTransformedMesheViews = fGpaTransformedMeshes.map(mesh => ui.show(fGpaGroup, mesh, "trMesh")) fGpaTransformedMesheViews(0).color = new Color(85,235,40) fGpaTransformedMesheViews(1).color = new Color(20,40,250) fGpaTransformedMesheViews(2).color = new Color(250,150,20) Here are the views: Again, the centroid sizes can be calculated as: original mesh 1: 487.222, mesh 1 full-GPA-transformed: 418.329 original mesh 2: 692.725, mesh 2 full-GPA-transformed: 417.958 original mesh 3: 566.668, mesh 3 full-GPA-transformed: 419.082 mean mesh: 421.707 As it can be observed, the original meshes are scaled in the case of the full GPA analysis to fit the mean mesh. The sizes of all the transformed and the mean mesh are close to each other, because, scaling is part of the transformations fitting the meshes onto a common mean. Also it should be noted that the (relative) scales of the meshes are not preserved. Saving the mean meshes and the fits into files and closing the object: MeshIO.writeMesh(meanMesh, new java.io.File(drive + "registration/Partial GPA/" + "meanMesh.stl")) (0 until 3).foreach(i => MeshIO.writeMesh(transformedMeshes(i), new java.io.File(drive + "registration/Partial GPA/" + s"mesh{i}.stl")))

MeshIO.writeMesh(fGpaMeanMesh, new java.io.File(drive + "registration/Full GPA/" + "meanMesh.stl"))
(0 until 3).foreach(i => MeshIO.writeMesh(fGpaTransformedMeshes(i), new java.io.File(drive + "registration/Full GPA/" + s"mesh{i}.stl"))) } ## Full GPA and Shapes Until this moment, I did not refer to a mesh/surface as a shape, because a shape has its own technical definition. The shape of an object/configuration/mesh is all its geometrical information modulo its scale, location, and rotational effects. In other words, shape is an equivalence class of geometrical objects modulo the similarity transformation. The full GPA removes the scale, location, and rotational effects of objects/meshes (relative to a mean object/mesh) in order to fit them all onto the mean mesh, which is the sample full Procrustes mean shape . Therefore, whatever geometrical variations that remain between the mean shape and each of the full Procrustes fits (full-GPA transformed meshes) reflect the variations due to the differences/dissimilarities in their shapes. # Probability and Stats Cheat Sheet (1) ### 0- Conventions 1- r-Vectors in matrix operations are considered as column vectors (matrices): x\in \mathbb{R}^r\equiv\begin{bmatrix}x_1\\x_2\\x_3\\ \vdots \\ x_r \end{bmatrix}\in\mathbb{R}^{r\times 1} \implies x^{\rm T}=\begin{bmatrix}x_1&x_2&x_3& \dots & x_r \end{bmatrix} 2- Column vector inner/dot product: x.y \equiv (x)^\text T(y)=(y)^\text T(x)=x^\text Ty=y^\text Tx\ \in \mathbb{R} \ \forall x,y\in \mathbb{R}^{r\times 1} ### 1- Population expected value and covariance matrix 1- a random r-vector : \mu_X= \text{E[X]}:=[\text E[X_1],\dots,E[X_r]]^\text T=[\mu_1,\dots,\mu_r]^\text T\\ \ \\ \text{E}[X^\text T]:=(\text E[X])^\text T \Sigma_{XX}:=\underset{\color{blue}{\text or\ Var(X)}}{\text{Cov}}(X,X):=\text E\big[(X-\mu_X)(X-\mu_X)^\text T\big]\ =\sigma_{ij}^2 \in\mathbb{R}^{r\times r} where, , and . \Sigma_{XX}=\text E[XX^\text T]-\mu_X\mu_X^\text T \Sigma_{XX}=\text E[XX^\text T] \iff \mu_X=0 2- For random vectors and : \Sigma_{XY}:=\text{Cov}(X,Y):=\text E\big [(X-\mu_X)(Y-\mu_Y)^\text T\big]=\Sigma_{YX}^\text T\ \in\mathbb{R}^{r\times s} \Sigma_{XY}=\text E\big [XY^\text T\big] \iff \mu_X=\mu_Y=0 Remark: is a symmetric matrix but is not necessarily symmetric. 3- linearly related to as with and a constant: \mu_Y=A\mu_X + b \ \in \mathbb{R}^{s \times 1}\\ \ \\ \Sigma_{YY}=A\Sigma_{XX} A^\text T\ \in \mathbb{R}^{s \times s} 4- and : \text {Cov}(X+Y,Z)=\Sigma_{XZ}+\Sigma_{XY}\\ \ \\ \text {Cov}(X+Y,X+Y)=\Sigma_{XX}+\Sigma_{XY}+\Sigma_{YX}+\Sigma_{YY} 5- , , and : \text {Cov}(AX,BY)=A\Sigma_{XY}B^\text T\ \in \mathbb {R}^{r \times s} For the proof, expand . In a special case: \text {Var}(AX)=\text {Cov}(AX,AX)=A\Sigma_{XX}A^\text T\ \in \mathbb {R}^{r \times r} 6- is positive semi-definite. Proof: show that . Use 5 in the proof. u^\text T\Sigma_{XX}u=\text{Cov}(u^\text TX, u^\text TX)=\text{Cov}(u^\text TX, (u^\text TX)^\text T)=\text{Var}(Z)\ge 0\\ \ \\ \qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad Z:=u^\text TX \in \mathbb{R} 7- A random vector is centred iff . Any random vector can be centred by the transformation . The following expressions are useful (proof is by expansion/writing the terms and induction): \forall X\in \mathbb{R}^{r\times 1}, Y\in \mathbb{R}^{t\times 1}\\ \ \\ \Sigma_{XY}=\text E\big [(X-\mu_X)(Y-\mu_Y)^\text T\big]=\text E\big [X_cY_c^\text T\big]=\Sigma_{X_cY_c}\\ \ \\ \Sigma_{XX}=\text E\big [(X-\mu_X)(X-\mu_X)^\text T\big]=\text E\big [X_cX_c^\text T\big]=\Sigma_{X_cX_c} \text E[X_c^\text TX_c]=\text {tr}(\Sigma_{XX}) \text E[X_c^\text TC Y_c]=\text {tr}(C\Sigma_{XY}) \qquad C\in \mathbb{R}^{r\times t} \text E[Y_c^\text T C^\text T C Y_c]=\text {tr}(C\Sigma_{YY}C^\text T) Above expressions involving the trace can have different form according to the properties of the trace. 8- Conditional probability density function: # Miscellaneous Proofs and Theorems in Probability and Statistics (1) ### 1- A minimization problem regarding PCA: Given that , , and and are both of full rank t, prove that the minimizing parameters, of \text E\big [(X-\mu – ABX)^T(X-\mu – ABX)\big ]\ \in \mathbb{R}^+ are: A=\begin{bmatrix}|&|& \dots & |\\v_1 &v_2 &\dots &v_t\\|&|& \dots & | \end{bmatrix}=B^\text T\\ \ \\ \mu=(I_r-AB)\mu_X where are the first eigenvectors of such that (this indicates how the eigenvalues and eigenvectors are sorted). is the i-th eigenvalue of ; and is the identity matrix. Let’s write and s.t is the centred , and do the expansion as follows: \begin{aligned} \text E\big [(X_c+\mu_x-\mu – C(X_c+\mu_x))^T(X_c+\mu_x-\mu – C(X_c+\mu_x))\big]\\ =\text E\big[ (X_c-CX_c)^\text T(X_c-CX_c)\big]+\text E\big [ (X_c-CX_c)^\text T\big](\mu_x-\mu-C\mu_x) \\+(\mu_x-\mu-C\mu_x)^\text T \text E\big [ (X_c-CX_c) \big] \\+\text E\big [(\mu_x-\mu-C\mu_x)^\text T(\mu_x-\mu-C\mu_x)\big] \end{aligned} As and the expected value is on a constant term in the 4th term, we can write: =\text E\big[ (X_c-CX_c)^\text T(X_c-CX_c)\big]+ (\mu_x-\mu-C\mu_x)^\text T(\mu_x-\mu-C\mu_x) The resultant expression contains positive terms. A necessary condition to minimize this expression is to make the 2nd term attains its minimum, i.e. zero. Therefore, . Now, we have to find such that it minimizes the 1st term, i.e.: E\big[ (X_c-CX_c)^\text T(X_c-CX_c)\big] Expanding the expression, we get the above as: \begin{aligned} =\text E\big[(X_c^\text T X_c- X_c^\text T CX_c- X_c^\text T C^\text T X_c+ X_c^\text T C^\text T CX_c\big] \end{aligned} By the relation to the covariance matrix (see this post), above becomes: =\text {tr}(\Sigma_{XX}-C\Sigma_{XX}-C^\text T\Sigma_{XX}+C\Sigma_{XX}C^\text T)\ \in \mathbb{R}^+ By the properties of trace, i.e. and , we have: =\text {tr}(\Sigma_{XX}-C\Sigma_{XX}-\Sigma_{XX}C^\text T+C\Sigma_{XX}C^\text T) As , and by the definition of matrix power, we can factor the above expression as: =\text {tr}\big(\Sigma_{XX}-\Sigma_{XX}+(C\Sigma_{XX}^{\frac{1}{2}}-\Sigma_{XX}^{\frac{1}{2}})(C\Sigma_{XX}^{\frac{1}{2}}-\Sigma_{XX}^{\frac{1}{2}})^\text T\big)\\ \ \\ =\text {tr}\big((C\Sigma_{XX}^{\frac{1}{2}}-\Sigma_{XX}^{\frac{1}{2}})(C\Sigma_{XX}^{\frac{1}{2}}-\Sigma_{XX}^{\frac{1}{2}})^\text T\big) The problem is now to find that minimizes: \text {tr}\big((\Sigma_{XX}^{\frac{1}{2}}-C\Sigma_{XX}^{\frac{1}{2}})(\Sigma_{XX}^{\frac{1}{2}}-C\Sigma_{XX}^{\frac{1}{2}})^\text T\big) which is equal to: \|\Sigma_{XX}^{\frac{1}{2}}-C\Sigma_{XX}^{\frac{1}{2}}\|_F with being the Frobenius norm of matrices. Considering the following facts: 1- If is symmetric, it is diagonalizable and for the eignevalues of and are related as . The eigenvectors of and are the same. 2- is symmetric and has the same rank, r, as . 3- has the rank at most . therefore, has the rank at most t. 4- A partial/truncated singular value decomposition (SVD) of a matrix is s.t and are from the SVD of A, i.e. . If is symmetric, then and s.t is an eigenvalue of . Note that the decreases with . 5- As a corollary of Eckart-Young-Mirsky Theorem: Let with , then . we can write: \argmin_B \|\Sigma_{XX}^{\frac{1}{2}}-B\|_F=\sum_{i=1}^{t}\lambda_i^{\frac{1}{2}} v_i v_i^\text T\\ \ \\ \implies C\Sigma_{XX}^{\frac{1}{2}}=\sum_{i=1}^{t}\lambda_i^{\frac{1}{2}} v_i v_i^\text T\\ \ \\ \Sigma_{XX}\text{ being positive definite} \implies C=(\sum_{i=1}^{t}\lambda_i^{\frac{1}{2}} v_i v_i^\text T)\Sigma_{XX}^{-\frac{1}{2}}\tag{1} where , and is the associated (normalized) eigenvector of . As with the diagonal matrix of eigenvalues of and the eigenvectors’ unitary matrix, we can write: . Therefore: v_i^\text T=\lambda_i^{-\frac{1}{2}}v_i^\text T\Sigma_{XX}^\frac{1}{2} Substituting the above in Eq. 1, we’ll get: C=\sum_{i=1}^{t} v_i v_i^\text T=\begin{bmatrix}|&|& \dots & |\\v_1 &v_2 &\dots &v_t\\|&|& \dots & | \end{bmatrix} \begin{bmatrix}\text{—}v_1\text{—}\\\text{—}v_2\text{—}\\ \dots\\ \text{—}v_t\text{—} \end{bmatrix}=V^*V^{*\text T} As , we accomplish the proof by letting . ### 2- Eckart-Young-Mirsky Theorem Let with , then s.t is the partial SVD of i.e . # PCA: Introduction Different kind of inter-correlated data can be collected into a random vector. For example we would consider the total length, weight, diameter of nose holes, tail length, number of teeth, colour of eyes, the radius of acetabula, number of vertebrae of dinosaurs as data and collect them into a (random) vector. The principal component analysis (PCA) is a method for reducing the dimension of a random vector (data), and also for discovering outstanding features of data. Using PCA, one can replace a random vector with another vector having a smaller dimension without losing much information. In PCA, the measure of information loss is the total variation of the original variables collected in a random vector. See this cheat sheet for the mathematical identities used in this post. ## Least-squares optimality of PCA Assume random variables () collected in a random r-vector , i.e. . It has a mean and a covariance matrix . A linear projection of the random variables means: \xi=BX\text{ s.t. } \xi \in \mathbb{R}^{t\times 1}\text{ and }B \in \mathbb{R}^{t\times r} \\ t\le r Each rows of linearly combines the random variables to create a new random variable. Let’s project the r-vector onto a lower dimensional space and obtain the t-vector with . We want to find a linear map to reconstruct out of with lower dimension. This means finding a vector and a matrix such that . The approximation is in the least-squares sense of minimizing the mean error. We consider the following as the measure of how well the linear map can reconstruct out of its linear projection : \text E\big [(X-\mu – A\xi)^T(X-\mu – A\xi)\big ]\ \in \mathbb{R}^{+} Therefore, we should find , , and by minimizing the following mean: \text E\big [(X-\mu – ABX)^T(X-\mu – ABX)\big ]\ \in \mathbb{R}^+\tag{1} This expression is the mean of the squared norm/magnitude of the error vector which itself is a random vevtor. With the assumption that both and are of full rank t, the minimizing parameters will be (see the proof here): A=\begin{bmatrix}|&|& \dots & |\\v_1 &v_2 &\dots &v_t\\|&|& \dots & | \end{bmatrix}=B^\text T\\ \ \\ \mu=(I_r-AB)\mu_X \tag{2} where is a normalized eigenvector of the covariance matrix and it is associated with the i-th largest eigenvalue, , of . The eigenvectors are orthonormal to each other, i.e. ( is the Kronecker delta). is the identity matrix. From the structures of the matrices and , it can be observed that they have rank t. Therefore, the best rank-t approximation of the original random vector is : X\approx \tilde X= \mu_x+C(X-\mu_x)\qquad \text{s.t }C=AB=AA^\text T \in \mathbb{R}^{r\times r}\tag{3a} or: \tilde X= \mu_x+\sum_{i=1}^{t}(v_iv_i^\text T)(X_c)\qquad \text{s.t }X_c=X-\mu_x\tag{3b} Remark: , and . The covariance matrix is real and symmetric, therefore it has a spectral decomposition as s.t is an orthogonal matrix (a matrix with ortho-normal columns/rows) with columns as normalized eigenvectors of , and is a diagonal matrix having the eigenvalues, , of . The eigenvectors in are sorted such that the j-th column of is associated with the j-th largest eigenvalue. Having said that, we observe that the best rank-t approximation of the original random vector only uses the first t eigenvectors of random vector’s covariance matrix. The minimum of Eq. 1 is: \sum_{i=t+1}^{r} \lambda_i \tag{4} Principal components (PC’s) of X: the first t principal components (also known as Karhunen-Loeve transformation) of are defined by the following linear projections: \xi^{(t)}=A^{\text T}X\ \in \mathbb{R}^{t\times 1} \implies\xi_i^{(t)}=v_i^{\text T}X\qquad i=1,\dots ,t\le r\tag{5} The superscript is used to indicate the dimension of the PC vector, being equal to the number of PC’s. Note that PC’s are themselves random variables as they are linear combination of the data random variables. PC’s of X are uncorrelated: the covariance between each pair of the principal components is: \text {Cov}(\xi_i,\xi_j)=\text {Cov}(v_i^{\text T}X,v_j^{\text T}X)= v_i^{\text T}\Sigma_{XX}v_j=\lambda_j v_i^{\text T}v_j=\delta_{ij}\lambda_j\\ \tag{6} Interestingly . This means that principal components of are uncorrelated. Moreover, . A goodness-of-fit measure: equation 4 maps the r-dimensional data () into a lower t-dimensional data . The total variance of the original data, i.e. collected in is . Based on this, a measure on how well the first t principal components (in the lower r-D space) represent the r original data variables is considered as (also note Eq. 4 and 6): \frac{\sum_{i=t+1}^{r} \lambda_i}{\sum_{i=1}^{r} \lambda_i}=\frac{\text{total variance of }X_i\text{\textquoteright s}-\text{total variance of }\xi_i\text{\textquoteright s}}{\text{total variance of }X_i\text{\textquoteright s}}\tag{7} This implies that the measure is small if the first t principal components contain a large part of the total variation in ; hence, better approximation. Remark: if , Eq. 7 becomes zero and it means perfect match, i.e no approximation. In other words, means ( as in the SVD of ), then because is unitary, , hence (Eq. 3), . ## PCA as a variance-maximization method The main aim of PCA is to reduce the dimension of data. The way a dinosaur would do a sort of reduction on a data vector is keeping one component and eating the rest all. It is a perfectly nonsense reduction, but many pieces of valuable information are lost. Another way, is to do a simple average . This approach is yet undesirable because it treats all components as equal importance (equal weight). Another approach is to do a weighted average with some constraint on the weights, like: such that . In matrix notation and using the Euclidean vector norm: \sum_{i=1}^{r}w_iX_i=w^\text TX \quad \text{ s.t }\quad \sum_{i=1}^{r}w_i^2=w^\text T w=\|w\|^2=1\qquad w_i\in \mathbb{R} This constraint weighted average is called a standardised linear combination (SLC). However, the problem is how to uniquely and meaningfully determine the weights. PCA chooses the weights in order to maximize the variances of the SLC’s of the variables ‘s, or to maximize the variances of the linear projections of the random vector, i.e. . SLC, is in fact the projection of the vector onto the direction described by the weight vector. In a mathematical notation, we want to solve the following maximization problem: \max_{\{w:\|w\|=1\}}\text{Var}(w^\text TX)=\max_{\{w:\|w\|=1\}}w^\text T\text{Var}(X)w=\max_{\{w:\|w\|=1\}}w^\text T\Sigma_{XX}(X)w\\ \ \\ w\in \mathbb{R}^{r\times 1},\quad X\in \mathbb{R}^{r\times 1} \quad,\Sigma_{XX}\in \mathbb{R}^{r\times r} \ \tag{8a} In other words, v=\argmax_{\{w:\|w\|=1\}}\text{Var}(w^\text TX)\tag{8b} Principal components of X are defined based on this maximization. The 1st PC of X, denoted by is the SLC (of the variables) with the highest variance. This is obtained by the maximizing problem expressed by Eq. 8. then has the highest variance among all other SLC of the variables/projections of : v_1=\argmax_{\{w:\|w\|=1\}}\text{Var}(w^\text TX)\qquad v_1\in \mathbb{R}^{r\times 1} \\ \ \\ \text {PC}_1:=\xi_1:=v_1^\text TX\qquad \xi_1\in\mathbb{R}\ \tag{9} is called the 1st principal direction (PD) or principal axis. The 2nd PC of X, denoted by is the SLC (of the variables) with the highest variance such that the maximizing weight vector, is also subjected to the orthogonality constraint . In a mathematical notation, the 2nd PC is obtained by the following maximization: v_2=\argmax_{\{w:\|w\|=1 \text{ and }w^\text Tv_1=0\}}\text{Var}(w^\text TX) \\ \ \\ \text {PC}_2:=\xi_2:=v_2^\text TX\qquad \ \tag{10} by the same token, the i-th PC and i-th PD or principal axis are defined by the following: v_i=\argmax_{\{w:\|w\|=1\text{ and }w^\text T v_j=0 \text{ for all } j=1,\dots,i-1\}}\text{Var}(w^\text TX) \\ \ \\ \text {PC}_i:=\xi_i:=v_i^\text TX\qquad \ \tag{11} It can be proved (see the references) that the maximizing weight vectors are the normalized eigenvectors of associated with its eigenvalues sorted in the decreasing order, i.e.: \Sigma_{XX}=V\varLambda V^\text T\ \text{ s.t } \lambda_{i}\ge \lambda_{j} \implies i\lt j\\ \ \\ \text{i.e. }\lambda_1\ge \lambda_2\ge \dots \ge\lambda_r \ge 0 Therefore: \text{cov}(\xi_i,\xi_j)=\text {Cov}(v_i^\text TX,v_j^\text TX)=v_i^\text T\Sigma_{XX} v_j=\lambda_j v_i^{\text T}v_j=\delta_{ij}\lambda_j\\ \ \\ \implies \text{Var}(\xi_1)\ge \text{Var}(\xi_2)\ge\dots\ge\text{Var}(\xi_r)\ge0 which means PC’s not only have the maximum variances among all the possible SLC of the variables, or all the possible projections of the random vector, but also they are mutually uncorrelated. In a matrix form, we can write: \xi=V^\text TX\qquad \xi,X\in\mathbb{R}^{r\times 1},\ V^\text T\in\mathbb{R}^{r\times r}\tag{12} where and are the random vector and its linear projection vector containing the PC’s of respectively. It is readily concluded that: \text{Var}(\xi)=\text{Var}(V^\text TX)=V^\text T\Sigma_{XX}V=\delta_{ij}\lambda_j ## PCA in practice: Sample PCA In practice, the PC’s are to be estimated out of n independent observations of the random r-vector of data. This requires estimating of the mean and covariance matrix of the random vector of data; using sample mean and covariance. Here, we use the notation for the i-th independent trial to observe the random vector . Note: we denote an observation vector with a capital letter to indicate that it is still a random vector (out of the same population and through independent trials). To indicate a particular realization/value of we shall use small letter (by convention). We collect the sequence of n independent observations of the random vector in a set as: \{\bold X_i, i=1,2,\dots,n\}\ \text{ s.t }\bold X_i\in\mathbb{R}^{r\times 1}\tag{13} The estimate for is given by: \hat \mu_X=\bold{\bar X}:=\frac{1}{n}\sum_{i=1}^{n}\bold X_i\tag{14} let for n observations of the random vector; and let . Then, the estimate for the covariance matrix is: \hat\Sigma_{XX}=\frac{1}{n}\chi_c\chi_c^\text T\tag{15} Ordered eigenvalues and their associated eigenvectors of is then estimate the population parameters. This leads to writing: \hat\lambda_1\ge\hat\lambda_2\ge\dots\ge\hat\lambda_r\ge 0\\ \ \\ \hat A=(\hat v_1,\dots,\hat v_t)=\hat B^\text T\tag{16} In practice, the vectors of independent observations are collected as columns (or rows) of a matrix (just re-writing Eq. 13). Let’s call this matrix the matrix of observations. Here, each column represents an observation of : \bold X=\begin{bmatrix}|&|&\dots &|\\ \bold X_1 & \bold X_2 &\dots &\bold X_n \\|&|&\dots &| \end{bmatrix}\in \mathbb{R}^{r\times n}\tag{17} Recalling Eq. 5 or Eq. 12, PC’s can be calculated for each columns of the matrix of observations. In this regards, we obtain estimates of the PC’s of : \hat\xi_{ij}^{(t)}=\hat v_j^\text T\bold X_i\ \in\mathbb{R}\\ j=1,\dots,t\ \text{ and }\ i=1,\dots,n This is the estimate of the j-th PC of calculated based on the i-th observation . The term is called the score for the i-th observation on the j-th PC. For each observation, we can write the above in a more compact from: \hat\xi_{i}^{(t)}=\hat A^\text T\bold X_i\ \in\mathbb{R}^{t\times 1}\\ i=1,\dots,n\tag{18} ### References – Modern Multivariate Statistical Techniques, A. J. Izenman. – Applied Multivariate Statistical Analysis, W. Hardle, L. Simar. Although mostly paraphrased, this post might contain copyrighted sentences of the references. # Scala Fundamentals (1) ### 1- Block expression The intermediate results (vals/vars) in a block expression are local. Example: val firstName = "Polar" def name: String = { println("This is a block") val title = "Mr." val lastName = "Bear" title + " " + firstName + " " + lastName } println(name) println(title) Output: firstName: String = Polar name: String This is a block Mr. Polar Bear Error:(9, 17) not found: value title println(title) Block expression are useful for evaluating intermediate expressions and assigning the result to a val/var. ### 2- Objects and classes Everything is a kind of object in Scala; even primitive types, Int, Boolean, are objects. Also the program code to be run is an object and is defined in the form/pattern of an object. All objects can have methods and fields (attributes). Methods of an objects give functionality to the object to do task or computation on data. Fields are for storing data. Every object belongs to a class that have only one instance. A class is an abstract template for creating objects that have similar methods and fields. A class also defines/create a Type (of objects) and therefore objects of the same class have the same Type. A class is not a value (like functions and objects) and it lives in its own namespace. Scala has two namespace: namespace of values, and namespace of types. Classes live in the type namespace. The functionality of an object (if it has one) can be called like a function. A function is a first class value whereas a method belongs to an object. For example: object addNumbers{ def addEm (a: Int, b: Int) = println(a + b) } addNumbers.addEm(1,2) // compare with object addNumbers{ def apply (a: Int, b: Int) = println(a + b) } addNumbers(1,2) Function-like application of any object is available by using the apply method. An executable Scala program is an object with an entry point, i.e. the main method. An object can be made executable (or an executable Scala code) by either adding a main method, or making the object extends the type App: object MyProgram{ // this is a singleton objec import ... //import libraries def main(args : Array[String]) : Unit = { // Array[String] are used for the command line arguments // body of the main program to do stuff } } // or object MyProgram extends App { // body of the main program to do stuff } #### 1- Singleton object Is an objects which is unique and we cannot create other values of this object. Singleton object belongs to a class that have only one instance. As it can be seen the type of a singleton object defined as bellow is itself. The term Box@3ae78a4c indicated the type @ a unique (reference) identifier of the object. All types of objects have unique identifiers.

object Box
res0: Box.type = Box$@3ae78a4c #### 2- Object (normal) An object should belong to a predefined class. Example: class Animal(val kind: String, val location: String) { def loc = println(kind + " lives in " + location) } val polarbear_1 = new Animal("Bear", "north pole") polarbear_1.loc polarbear_1.kind // results defined class Animal polarbear_1: Animal = Animal@669cb6c7 Bear lives in north pole res2: String = Bear The class Animal creates a new type Animal that can be used like any other types. This is a type of an object. #### 3- Companion Objects A companion object of a class is a singleton object with the same name as a class (and defined in the same file as the class’s file). The class is also called the companion class of the object. These two live in different namespaces, i.e. value and type namespaces. It is then important to note that the companion object is not an instant of its companion class. One application is to have the (singleton) companion object manage and implement the constructors of an instant (object) of the companion class. In this case the companion object is an auxiliary tool. For example: class Line(val point_1:Double, val point_2:Double){ def printPoints = println(point_1,point_2) } val line_1 = new Line(0.0, 1.0) // Line is the name of the class line_1.printPoints // Compare with class Line(val point_1:Double, val point_2:Double){ def printPoints = println(point_1,point_2) } object Line{ def apply(point_1:Double, point_2:Double): Line = new Line(point_1,point_2) } val line_1 = Line(0.0,1.0) // Line is the name of the object line_1.printPoints #### 4- Case classes Case classes are useful shorthand for defining a class, its companion objects and several useful features at once. They are ideal for lightweight data-holding classes. Defining a case class is as of a class, however, with adding the literal case, and the val keyword for the constructors is optional. It also does not need a new keyword as there is a companion object with an apply method for the purpose of creating the case class. Example: case class Animal (kind:String, location:String){ def loc = println(kind + " lives in " + location) } // instantiate a case class using its class val polarbear_1 = new Animal("polar bear", "north pole") polarbear_1.loc // instantiate a case class using its companion object automatically built val polarbear_2 = Animal("polar bear", "north pole") polarbear_2.loc #### Features of case classes 1- Sensible equal and hashCode methods acting on the field values of the objects rather than the reference identity. Example: polarbear_1 == polarbear_2 // result res2: Boolean = true // but two different values polarbear_1.eq(polarbear_2) //eq method compares the reference ids //result res2: Boolean = false  2- Copy method val polarbear_3 = polarbear_1.copy() // it is a shallow copy 3- Pattern matching using case classes. Pattern match is an extended if. To match against the constructor values. Example: case class Animal (kind:String, location:String){ } object inspectAnimal{ def inspect(anAnimal: Animal): String = anAnimal match { // it reads: check if anAnimal matches a case and then output as insructed case Animal("bear", "north pole") => "this is a polar bear" case Animal("bear", "Banff") => "this is a Grizly or black bear" case Animal(_, "Banff") => "this might be a bear" case Animal(its_kind,its_location) => s"this is a/an$its_kind living in $its_location" //binds its_kind, its_location to the constructor values } } val animal_1 = Animal("bear", "north pole") val animal_2 = Animal("Marmot", "Banff") val animal_3 = Animal("cat", "my house") inspectAnimal.inspect(animal_1) // result: this is a polar bear inspectAnimal.inspect(animal_2) // result: this might be a bear inspectAnimal.inspect(animal_3) // result: this is a/an cat living in my house To mach against the type. Example: https://docs.scala-lang.org/tour/pattern-matching.html. #### 5- Abstract class ### 3- Tips with objects and classes #### 1- Overloading the methods (overloaded methods) Instead of creating several methods with different names and doing the same task but receiving different number of arguments and/or different types of parameters, we can create methods with the same name but different parameter and let Scala decide on the proper version of the method. Overloading methods is applicable to classes and singleton objects. Example: object doSomething{ def sumNumbers(a:Int, b:Int): Unit = println(a + b) def sumNumbers(a:Double, b:Double): Unit = println(a + b) def sumNumbers(a:Int, b:Int, c:Int): Unit = println(a + b + c) } doSomething.sumNumbers(2,3) // 5 doSomething.sumNumbers(12.345,34.567) // 46.912 doSomething.sumNumbers(1,2,7) // 10 #### 2- Auxiliary constructor Other than the primary constructors of a class, auxiliary constructors can also be utilized. Auxiliary constructors overload the constructors. To do this, the following should be considered: • Auxiliary constructors are defined by def keyword. • Like Method Overloading, all auxiliary constructors should use the same name this. • each auxiliary constructor must have a different different parameters list. • each auxiliary constructor must first call a previously defined primary constructor(s) or/and an auxiliary constructor(s). This is performed by name this. Nevertheless, constructor overloading can be implemented by setting default values for the primary constructors. ### 4- Traits (Subtyping) Classes are abstraction over objects while traits are abstraction over classes; in other words, traits are template for classes. Traits allow us to define a super type for several classes which are the subtypes. This super type is is outside of the Any super type that all classes share. Classes under a trait can share the implementation of the same operation. A trait creates interface that any subtype, i.e. classes, must implement (comply with). Compared with a class, a trait does not/cannot logically have a constructor. Classes (having constructors) can be created from a trait and objects from the classes. A trait can define abstract fields and methods that have names and type signature but no (concrete) implementation. An implementation is possible when defining an extending class under the trait. Implementing methods based on abstract methods in a trait is possible. Whatever abstract methods and fields are defined in a trait have to be then implemented in the extending class(es). Concrete (default) implementation of any method (or a field) in a trait is possible as well, however, it must be make sure that the method is valid in all the extending classes (subtypes). Otherwise, the method (or field) should be overridden in a subtype. Abstract fields in a trait can be defined either using val or def. However, it is recommended to use def because def is a generalization of val in Scala. If an abstract field is defined by def, then val will be used for its implementation in the extending class. Abstract methods (or fields) can have completely different implementations in different subtypes. Example: Animals can be wild or pet. Instead of creating two separate non-connected (in terms of sharing fields and methods) classes, we create two extending classes under a trait. trait Animal { def kind: String // can also be val kind: String def location: String def inspect: Unit = println(s"$kind lives in $location") def id: Unit } case class Wild (kind: String, location: String, idNumber: Int) extends Animal { def id: Unit = print(s"The wild animal id is:$idNumber")
}

case class Pet (kind: String, location: String, idNumber: Int) extends Animal{
def id: Unit = print(s"The pet id is: \$idNumber")

}

val bear_1 = Wild("bear", "north pole", 1212123)
val cat_1 = Pet ("Cat", "my house", 123456)
bear_1.inspect
bear_1.id
cat_1.inspect
cat_1.id
//results
/*
defined trait Animal

defined class Wild

defined class Pet

bear_1: Wild = Wild(bear,north pole,1212123)

cat_1: Pet = Pet(Cat,my house,123456)

bear lives in north pole
The wild animal id is: 1212123
Cat lives in my house
The pet id is: 123456
*/

# Scala with IntelliJ IDEA

### 1- Defining a package object

1. On the project side bar right click on Scala folder –> new –> package.
2. name the package custom.methods.
3. Right click on custom.methods –>new –> package object

The package object is now ready.

package custom
package object methods {
// define your custom classes, case classes, objects, methods here
}

# Gaussian process morphable models

Coming late.

But read the following article if you cannot wait for me writing this post:

[1] Gaussian Process Morphable Models, Marcel Luthi , Thomas Gerig, Christoph Jud, and Thomas Vetter. IEEE Transactions on Pattern Analysis and Machine Intelligence (Volume: 40 , Issue: 8 , Aug. 1 2018).

# Miscellaneous Proofs for SSM (1)

### 1) The sum of the squared of distances is minimized by the vector means

Prove . are vectors and is a vector norm coming from the inner product defined on the vector space; .

Proof:

\begin{aligned} \argmin_{\mu} \sum_{i=1}^{n} \|X_i-\mu\|^2=\argmin_{\mu}\sum_{i=1}^{n} \lang X_i-\mu , X_i-\mu \rang \\ =\argmin_{\mu}\sum_{i=1}^{n} \lang X_i,X_i \rang -2\lang X_i,\mu \rang + \lang\mu,\mu \rang \\ \text{as} \lang X_i,X_i\rang \text{ is a constant not depending on } \mu \\ =\argmin_{\mu}\sum_{i=1}^{n} -2\lang X_i,\mu\rang + \lang \mu,\mu \rang\\ =\argmin_{\mu}(n\lang \mu,\mu \rang-2 \sum_{i=1}^{n} \lang X_i,\mu \rang)\\ \text{by the linearity of inner product}\\ =\argmin_{\mu}(n\lang \mu,\mu \rang -2\lang\sum_{i=1}^{n} X_i,\mu\rang)\\ =\argmin_{\mu}(n\lang \mu,\mu \rang -2n\lang\frac{1}{n}\sum_{i=1}^{n} X_i,\mu\rang)\\ =\argmin_{\mu}(n\lang \mu,\mu \rang -2n\lang\ \overline{X},\mu\rang)\\ \text{dividing by a constant, n, won’t affect the }\argmin\\ =\argmin_{\mu}(\lang \mu,\mu \rang -2\lang\ \overline{X},\mu\rang)\\ \text{adding the constant} \lang \overline{X},\overline{X} \rang \text{ won’t affect the } \argmin \text{ therefore: }\\ =\argmin_{\mu}(\lang \mu,\mu \rang -2\lang\ \overline{X},\mu\rang +\lang \overline{X},\overline{X} \rang)\\ =\argmin_{\mu}(\lang \mu – \overline{X},\lang \mu – \overline{X}\rang )\\ =\argmin_{\mu}(\| \mu – \overline{X}\|^2 )\\ \end{aligned}

Because is larger than or equal zero, the minimum of the norm is zero, hence:

In a case that and are matrices, the norm is the Frobenius norm and the inner product is the Frobenius inner product.

### 2) Proof regarding the full GPA method

Proving the following:

\inf_{\beta_i,\Gamma_i,\gamma_i}\sum_{i=1}^{n} \|(\beta_i X_i\Gamma_i+1_k\gamma_i^\text{T}) – \frac{1}{n} \sum_{j=1}^{n} (\beta_j X_j\Gamma_j+1_k\gamma_j^\text{T})\|^2\\ =\\ \inf_{\beta_i,\Gamma_i,\gamma_i}\frac{1}{n}\sum_{i=1}^{n} \sum_{\underset{{j\lt n}}{j=i+1}}^{n}\|(\beta_i X_i\Gamma_i+1_k\gamma_i^\text{T}) – (\beta_j X_j\Gamma_j+1_k\gamma_j^\text{T})\|^2

The “inf” operator is not restated in the proof. Let . Therefore:

\sum_{i=1}^{n} \|A_i – \frac{1}{n} \sum_{j=1}^{n} A_j\|^2 =\frac{1}{n^2}\sum_{i=1}^{n} \|nA_i – \sum_{j=1}^{n} A_j\|^2\tag{I}

The Frobenius norm comes from the Frobenius inner product, i.e. . Therefore, above continues as:

=\frac{1}{n^2}\sum_{i=1}^{n} \lang nA_i – \sum_{j=1}^{n} A_j,nA_i – \sum_{j=1}^{n} A_j\rang

Inner product is a bi-linear form and hence by its linearity and commutative properties:

\begin{aligned} =\frac{1}{n^2}\sum_{i}( \lang nA_i – \sum_{j} A_j,nA_i\rang – \lang nA_i – \sum_{j} A_j,\sum_{j} A_j\rang)\\ =\frac{1}{n^2}\sum_{i}( \lang nA_i,nA_i\rang – \lang nA_i, \sum_{j} A_j\rang- \lang \sum_{j} A_j,nA_i\rang \\+ \lang \sum_{j} A_j,\sum_{j} A_j\rang)\\ =\frac{1}{n^2}\sum_{i}( n^2\lang A_i,A_i\rang – 2n\lang A_i, \sum_{j} A_j\rang + \lang \sum_{j} A_j,\sum_{j} A_j\rang) \end{aligned}

Again by applying the linearity property of inner product on the finite sums:

=\frac{1}{n^2}\sum_{i}( n^2\lang A_i,A_i\rang – 2n\sum_{j}\lang A_i, A_j\rang + \sum_{l}\sum_{j}\lang A_l, A_j\rang)

Applying the :

\begin{aligned} =\frac{1}{n^2}\sum_{i}( n^2\lang A_i,A_i\rang – 2n\sum_{j}\lang A_i, A_j\rang + \sum_{l}\sum_{j}\lang A_l, A_j\rang)\\ =\frac{1}{n^2}(n^2\sum_{i}\lang A_i,A_i\rang – 2n\sum_{i}\sum_{j}\lang A_i, A_j\rang\\ + \sum_{i}\sum_{l}\sum_{j}\lang A_l, A_j\rang)\\ =\frac{1}{n^2}(n^2\sum_{i}\lang A_i,A_i\rang – 2n\sum_{i}\sum_{j}\lang A_i, A_j\rang + n\sum_{l}\sum_{j}\lang A_l, A_j\rang)\\ \text{As the last two double-sigma terms are identical:}\\ =\frac{1}{n}(n\sum_{i}\lang A_i,A_i\rang – \sum_{i}\sum_{j}\lang A_i, A_j\rang )\tag{II} \end{aligned}

The rest can be done by induction, but for simplicity we continue with writing a number of terms to observe the pattern and reach the final result. With , and letting , the expansion of the above term in the parentheses (last equation) becomes:

(4-1)\lang A_1,A_1\rang+(4-1)\lang A_2,A_2\rang+(4-1)\lang A_3,A_3\rang+(4-1)\lang A_4,A_4\rang\\ -2\lang A_1,A_2\rang -2\lang A_1,A_3\rang-2\lang A_1,A_4\rang -2\lang A_2,A_3\rang -2\lang A_2,A_4\rang\\ -2\lang A_3,A_4\rang\\ =\lang A_1,A_1\rang -2\lang A_1,A_2\rang +\lang A_2,A_2\rang\\ +\lang A_1,A_1\rang -2\lang A_1,A_3\rang +\lang A_3,A_3\rang\\ +\lang A_1,A_1\rang -2\lang A_1,A_4\rang +\lang A_4,A_4\rang\\ +\lang A_2,A_2\rang -2\lang A_2,A_3\rang +\lang A_3,A_3\rang\\ +\lang A_2,A_2\rang -2\lang A_2,A_4\rang +\lang A_4,A_4\rang\\ +\lang A_3,A_3\rang -2\lang A_3,A_4\rang +\lang A_4,A_4\rang\\ \begin{aligned} =\|A_1-A_2\|^2+\|A_1-A_3\|^2+\|A_1-A_4\|^2+\|A_2-A_3\|^2\\ +\|A_2-A_4\|^2+\|A_3-A_4\|^2 \\ =\sum_{i=1}^{4}\sum_{\underset{{j\lt 4}}{j=i+1}}^{4}\|A_i-A_j\|^2 \end{aligned}

Consequently, for any number of terms Eq. II becomes . Hence, recalling Eq. I, we complete the proof:

\sum_{i=1}^{n} \|A_i – \frac{1}{n} \sum_{j=1}^{n} A_j\|^2=\frac{1}{n}\sum_{i=1}^{n}\sum_{\underset{{j\lt n}}{j=i+1}}^{n}\|A_i-A_j\|^2\qquad\checkmark