Linear Algebra Cheat Sheet (1)

0- Notations and convention

A variable/quantity/element can be scalar, vector, or a matrix; its type should be understood from the context if not explicitly declared.

0-1- A vector z\in \mathbb{F}^{n} is considered as a column vector, i.e. a single-column matrix. Therefore:

z\in \mathbb{F}^{n}\equiv z\in \mathbb{F}^{n\times 1}

A row vector is therefore defined as  w^\text T\in \mathbb{F}^{1\times n}.

0-2- Dot product of column vectors:

\text{For }x,y\in \mathbb{R}^{n\times 1}, \ x\cdot y:=x^\text Ty

0-3- The ij-th element of a matrix A is denoted by A_{ij}. The i-th column and j-th row of the matrix are respectively denoted by A_{,i} and A_{j,} .

0-4- Columns and rows of a matrix are column and row vectors. A m\times n-matrix can then be represented as:

\begin{bmatrix} v_1& v_2& \dots & v_n\end{bmatrix}\ \text{ s.t } v_i=A_{,i}

or

\begin{bmatrix}u_1^\text T\\ u_2^\text T\\ \vdots\\ u_n^\text T\end{bmatrix}\ \text{ s.t } u_i^\text T=A_{i,}


1- Orthogonal matrix

a square matrix A\in \mathbb{R}^{n\times n} is said (definition) to be orthogonal iff A^{-1}=A^\text T; provided that A^{-1}, inverse of A exists. As a result, AA^{-1}=AA^\text T = I_n.

The following are equivalent for A:
a) A is orthogonal.
b) the column vectors of A are ortho-normal.
c) the row vectors of A are ortho-normal.
d) A is size preserving: \|Ax\|=\|x\|, \|.\| being the Euclidean norm, and x\in\ \mathbb{R}^{n\times 1}.
e) A is dot product preserving: Ax\cdot Ay=x\cdot y .


2- Some matrix multiplication identities

A\in \mathbb{F}^{m\times n},\ x\in \mathbb{F}^{n\times 1}\ \text{ then } Ax=\sum_{i=1}^n x_iA_{,i} A\in \mathbb{F}^{m\times n}\ , D\in \mathbb{F}^{n\times n}\ \text{ and diagonal }, B\in \mathbb{F}^{n\times k},\ \text{ s.t.}\\ \ \\ A=\begin{bmatrix} v_1& v_2& \dots & v_n\end{bmatrix}, \ B=\begin{bmatrix}u_1^\text T\\ \\ u_2^\text T\\ \vdots\\ u_n^\text T\end{bmatrix},\ D_{ij}= \begin{cases} \lambda_i &\text{, } i=j \\ 0 &\text{, } i\ne j \end{cases} \\ \ \\ \text{then }ADB=\sum_{i=1}^n \lambda_iv_i u_i^\text T

3- Change of basis

let \beta:=\{b_1,\dots,b_n\} and \beta':=\{b'_1,\dots,b'_n\} be two basis of \mathbb R^n. Then the following holds for the coordinates of a vector v\in\mathbb R^n with respect to the two bases:

[v]_{\beta’}=P_{\beta\to\beta’}[v]_{\beta}\quad\text{s.t} \\ \ \\ P_{\beta\to\beta’}=\begin{bmatrix} [b_1]_{\beta’}& [b_2]_{\beta’}& \dots & [b_n]_{\beta’}\end{bmatrix}

It can be proved that:

[v]_{\beta}=P_{\beta’\to\beta}[v]_{\beta’}\quad\text{s.t} \\ \ \\ P_{\beta’\to\beta}=P^{-1}_{\beta\to\beta’}

Rank of a Matrix

Let A\in  \mathbb{R}^{m\times n} (the results also holds for   \mathbb{C}^{m\times n}). Then, the column rank/row rank of A is defined as the dimension of the column/row space of A, i.e. the dimension of the vector space spanned by the columns/rows of A; this is then equivalent to the number of linearly independent columns/rows (column/rows vector) of A.

Theorem: column rank of A = row rank of A.

Definition: the rank of a matrix, rank(A), is the dimension of either the column or row space of A; simply the number of linearly independent columns or rows of A.

Definition: for a linear map f: \mathbb{R}^n\longrightarrow  \mathbb{R}^m, the rank of the linear map is defined as the dimension of the image of f. This definition is equivalent to the definition of the matrix rank as every linear map f:\mathbb{R}^n\longrightarrow  \mathbb{R}^m has a matrix A\in  \mathbb{R}^{m\times n} by which it can be written as f(x)=Ax.

Proposition: \text{rank}(A)\le \min(m,n). This leads to these definitions: A matrix A is said to be full rank iff \text{rank}(A)= \min(m,n), i.e. the largest possible rank, and it is said to be rank deficient iff \text{rank}(A)\lt \min(m,n), i.e. not having full rank.

Properties of rank

For  A\in  \mathbb{R}^{m\times n}:

1- only a zero matrix has rank zero.

2- If B\in  \mathbb{R}^{n\times k}, then  \text{rank}(AB)\le \min( \text{rank}(A) , \text{rank}(B))

3- \text{rank}(A+B)\le \text{rank}(A)+ \text{rank}(B)

4-  \text{rank}(A)=\text{rank}(A^{\text T})

5-  \text{rank}(A)= \text{rank}(A^{\text T}) =\text{rank}(AA^{\text T})= \text{rank}(A^{\text T}A)

6- If v\in \mathbb{R}^{r\times 1}, then for V=vv^\text T \in \mathbb{R}^{r\times r}, \text{rank}(V)=1. In addition, for v_i \in \mathbb{R}^{r\times 1} , \text{rank}(S=\sum_{i=1}^{n} v_iv_i^\text T)\le n, i.e. S has at most rank n.

7- A square matrix C\in \mathbb{R}^{n\times n} can be decomposed as C=U\Gamma U^{-1} where \Gamma is a diagonal matrix containing the eigenvalues of C. Then, \text{rank}(C)=\text{rank}(\Gamma), i.e. the number of non-zero eigenvalues of C.

8- For a square matrix C\in \mathbb{R}^{n\times n} , then equivalently C is full rank, is invertible, has non-zero determinant, and has n non-zero eigenvalues.

Proofs

P6:

V=\begin{bmatrix}v_1\begin{bmatrix}v_1\\v_2\\ \vdots\\v_r \end{bmatrix}&v_2\begin{bmatrix}v_1\\v_2\\ \vdots\\v_r \end{bmatrix}&\dots &v_r\begin{bmatrix}v_1\\v_2\\ \vdots\\v_r \end{bmatrix}\end{bmatrix}

where v_i\in \mathbb{R} are coordinates of the vector v. This indicates that each column of V is a scalar multiple of any other columns of V; therefore, the column space is one dimensional. Hence, \text{rank}(V)=1.

For the second part, property 3 proves the statement.

SSM Using Scalismo (4): Generalized Procrustes analyses

In this post, I am going to implement the partial and the full generalized Procrustes analyses (partial GPA and full GPA) with Scalismo. Before getting into the implementation, I may want read this post explaining the theory of GPA’s.

Partial GPA

Through partial GPA, more than 2 configurations/meshes get fitted/superimposed onto a common (and initially unknown) mean configuration. The procedure is iterative and involves translation and rotation of the meshes.

GPA’s need meshes in correspondence; here, I load the femoral head meshes that I previously had their correspondence established through the non-rigid registration (this post). Initializing and loading the meshes:

This post contains code blocks; if your browser does not automatically display the code blocks, please click on the displayed dark narrow rectangular areas to display the containing code.

object GPA extends App {
  import scalismo.geometry._
  import scalismo.common._
  import scalismo.ui.api._
  import scalismo.mesh._
  import scalismo.io.{StatisticalModelIO, MeshIO}
  import scalismo.statisticalmodel._
  import scalismo.registration._
  import scalismo.statisticalmodel.dataset._
  import java.awt.Color

  scalismo.initialize()
  implicit val rng = scalismo.utils.Random(42)

  val ui = ScalismoUI()
 // Loading data
  val drive = "I:/"
  val meshDir = drive + "registration/Registered/"
  val meshFiles = new java.io.File(meshDir).listFiles()
  meshFiles.foreach(file => println(s"\n$file"))
  println(meshFiles)

Visualizing the meshes:

val (meshes, meshViews) = meshFiles.map(meshFile => {
    val mesh = MeshIO.readMesh(meshFile).get
    val meshView = ui.show(dsGroup, mesh, "mesh")
    (mesh, meshView) // returns a tuple of the mesh and the associated view
  }) .unzip  //unzip the tuple into two separate lists

  // setting some colors
  meshViews(0).color = new Color(85,235,40)
  meshViews(1).color = new Color(20,40,250)
  meshViews(2).color = new Color(250,150,20)
Fig. 1. Meshes and views of their cross-sectional contours.

Here, methods are defined to calculate the centroid (center of mass) of a mesh and centering the mesh. This is not necessary for GPA, as it involves translation. Also, GPA method in Scalismo, calculates the centroid of a given reference mesh, and uses it as the center of rotation of the meshes through the process. It does not harm if all the meshes are initially centered. Also, it is consistent with the GPA algorithm explained in the related post.

def CM (mesh:TriangleMesh[_3D]): EuclideanVector3D = {
    val numOfPoints = mesh.pointSet.numberOfPoints
    var total = EuclideanVector3D(0,0,0)
    val meshPoints = mesh.pointSet.points //iterator
    while (meshPoints.hasNext){ //iterator
      total = total + (meshPoints.next()).toVector
    }
    total*(1d/numOfPoints)
  }

  val centeredMeshes = meshes.map(mesh => {
    val translationVec = CM(mesh)*(-1)
    val translation = TranslationTransform[_3D](translationVec)
    mesh.transform(translation)
  })

To do GPA’s with Scalismo, a DataCollection should be defined. A DataCollection needs a reference mesh and a sequence of meshes; all in correspondence. A GPA will be performed on the sequence of the meshes. The reference mesh is used as the domain of transformations. A mean mesh, in Scalismo, is defined as the mean deformations (vectors) of the meshes from the reference mesh, added to the reference mesh; it should be remembered that all the meshes are in correspondence. It does not mater which, but, one of the meshes should be selected as the reference mesh.

  // General Procrustes Analysis: Aligns a set of meshes with respect to some unknown “mean” shape (independent of ordering of shapes)
  // setting a reference mesh. it is needed for data collection but no effect on the GPA
  val refMesh = centeredMeshes(2)
  val allMeshes = centeredMeshes
  val dataClctn = ((DataCollection.fromMeshSequence(refMesh,allMeshes))._1).get
  // to check how many meshes are in the data collection
  println(dataClctn.size)

Performing the partial GPA (number of iterations = 10, precision of average point-wise distance between meshes (halt distance) = 1e-5):

// doing the partial GPA
  val gpaDataClctn = DataCollection.gpa(dataClctn ,10,1e-5)

Retrieving the mean mesh and visualization:

// retrieving the mean mesh
  val meanMesh = gpaDataClctn.meanSurface
  //CM(meanMesh) // to check whether the mean mesh is centered or not
  val meanGroup = ui.createGroup("meanMesh")
  val meanMeshViewg = ui.show( meanGroup, meanMesh,"meanMesh")
  meanMeshViewg.color = new Color(245,20,50)
Fig. 2. Mean mesh (red) and views of cross-sectional contours of the mean and the original meshes.

The transformed versions of the original meshes onto the mean mesh are called the partial Procrustes fits. They can be retrieved and viewed:

//aligning by translation and rotation all the meshes onto the mean mesh
  val meanMeshPoints = meanMesh.pointSet.points.toIndexedSeq

  val gpaTransforms = for (mesh <- allMeshes) yield {
    val meshPoints = mesh.pointSet.points.toIndexedSeq
    LandmarkRegistration.rigid3DLandmarkRegistration(meshPoints.zip(meanMeshPoints), center = Point3D(0, 0, 0))
  }
  val gpaGroup = ui.createGroup("partial Gpa meshes")
 //partial GPA fits
  val transformedMeshes = (allMeshes.zip(gpaTransforms)).map(z => {
    val mesh=z._1
    val transToMean = z._2
    mesh.transform(transToMean)
  })
  val  transfrmdMeshViews = transformedMeshes.map(mesh => ui.show(gpaGroup, mesh, "trMesh"))
Fig. 3. The original and the partial-GPA fits.

The centroid size (see this post) of each mesh in this example can be calculated by this method:

def centroidSize (mesh:TriangleMesh3D): Double ={
    var total = 0d
    val meshPoints = mesh.pointSet.points //iterator
    val centroid = CM(mesh) // is a vector here
    while (meshPoints.hasNext){ //iterator
      total = total + scala.math.pow(((meshPoints.next().toVector - centroid).norm),2)
    }
    scala.math.sqrt(total)
  }

For example:

  centroidSize(TransformedMeshes(0))
  centroidSize(allMeshes(0))
  centroidSize(MeanMesh)

The centroid sizes are (unit of length):
original mesh 1: 487.222, mesh 1 partial-GPA fit: 487.222
original mesh 2: 692.725, mesh 2 partial- GPA fit : 692.725
original mesh 3: 566.668, mesh 3 partial- GPA fit : 566.668
mean mesh: 577.732

As expected, the original and their partial-GPA fits have similar sizes due to no-scaling in the partial GPA. However, the mean mesh must have a different size; as it has.

Full GPA

The full GPA entails scaling, translation and rotation of the meshes in order to superimpose each mesh onto the (unknown) mean mesh. Scalismo does not currently have a method performing the full GPA, but its partial gpa can be manipulated to do the full GPA. To this end, we should access the package scalismo.statisticalmodel.dataset in the file DataCollection.scala to copy the needed pieces of codes regarding the partial GPA method. This is already done as below. The copied lines of code are used to define new functions methods. The only difference between the method doing the full GPA and the partial one, is that similarity3DLandmarkRegistration method in the former one instead of rigid3DLandmarkRegistration in the latter one.

import scalismo.utils.Random
  def fGpa(dc: DataCollection, maxIteration: Int = 3, haltDistance: Double = 1e-5)(implicit rng: Random): DataCollection = {
    fGpaComputation(dc, dc.meanSurface, maxIteration, haltDistance)
  }
  def fGpaComputation(dc: DataCollection, meanShape: TriangleMesh[_3D], maxIteration: Int, haltDistance: Double)(implicit rng: Random): DataCollection = {

    if (maxIteration == 0) return dc

    val referencePoints = dc.reference.pointSet.points.toIndexedSeq
    val numberOfPoints = referencePoints.size
    val referenceCenterOfMass = referencePoints.foldLeft(Point3D(0, 0, 0))((acc, pt) => acc + (pt.toVector / numberOfPoints))

    val meanShapePoints = meanShape.pointSet.points.toIndexedSeq

    // align all shape to it and create a transformation from the mean to the aligned shape
    val dataItemsWithAlignedTransform = dc.dataItems.par.map { dataItem =>
      val surface = dc.reference.transform(dataItem.transformation)
      val transform = LandmarkRegistration.similarity3DLandmarkRegistration(surface.pointSet.points.toIndexedSeq.zip(meanShapePoints), referenceCenterOfMass)

      DataItem("gpa -> " + dataItem.info, Transformation(transform.compose(dataItem.transformation)))
    }

    val newdc = DataCollection(dc.reference, dataItemsWithAlignedTransform.toIndexedSeq)
    val newMean = newdc.meanSurface

    if (MeshMetrics.procrustesDistance(meanShape, newMean) < haltDistance) {
      newdc
    } else {
      fGpaComputation(newdc, newMean, maxIteration - 1, haltDistance)
    }
  }

Let’s compute the mean mesh and then superimpose (by the similarity transformations) the meshes onto the mean mesh; the code is:

val fGpaDataClctn = fGpa(dataClctn ,20,1e-5)
  val fGpaMeanMesh = fGpaDataClctn.meanSurface
  //CM(meanMesh) // to check whether the mean mesh is centered or not
  val fGpaMeanGroup = ui.createGroup("fullGpaMeanMesh")
  val fGpaMeanMeshView = ui.show( fGpaMeanGroup,fGpaMeanMesh ,"fullGpaMeanMesh")
  fGpaMeanMeshView.color = new Color(245,20,50)

  val fGpaGroup = ui.createGroup("Full Gpa Meshes")
  val fGpaMeanMeshPoint = fGpaMeanMesh.pointSet.points.toIndexedSeq

  val fullGpaTransforms = for (mesh <-allMeshes) yield {
    val meshPoints = mesh.pointSet.points.toIndexedSeq
    // transformations include scaling
    LandmarkRegistration.similarity3DLandmarkRegistration(meshPoints.zip(fGpaMeanMeshPoint), center = Point3D(0, 0, 0))
  }
  //full GPA fits
  val fGpaTransformedMeshes = allMeshes.zip(fullGpaTransforms).map (z=> z._1.transform(z._2))
  val fGpaTransformedMesheViews = fGpaTransformedMeshes.map(mesh => ui.show(fGpaGroup, mesh, "trMesh"))

  fGpaTransformedMesheViews(0).color = new Color(85,235,40)
  fGpaTransformedMesheViews(1).color = new Color(20,40,250)
  fGpaTransformedMesheViews(2).color = new Color(250,150,20)

Here are the views:

Fig. 4. full-GPA Mean mesh (red) and views of cross-sectional contours of the mean and the original meshes.
Fig. 5. The original meshes and their full-GPA fits.

Again, the centroid sizes can be calculated as:
original mesh 1: 487.222, mesh 1 full-GPA-transformed: 418.329
original mesh 2: 692.725, mesh 2 full-GPA-transformed: 417.958
original mesh 3: 566.668, mesh 3 full-GPA-transformed: 419.082
mean mesh: 421.707

As it can be observed, the original meshes are scaled in the case of the full GPA analysis to fit the mean mesh. The sizes of all the transformed and the mean mesh are close to each other, because, scaling is part of the transformations fitting the meshes onto a common mean. Also it should be noted that the (relative) scales of the meshes are not preserved.

Saving the mean meshes and the fits into files and closing the object:

MeshIO.writeMesh(meanMesh, new java.io.File(drive + "registration/Partial GPA/" + "meanMesh.stl"))
  (0 until 3).foreach(i => MeshIO.writeMesh(transformedMeshes(i), new java.io.File(drive + "registration/Partial GPA/" + s"mesh${i}.stl")))

  MeshIO.writeMesh(fGpaMeanMesh, new java.io.File(drive + "registration/Full GPA/" + "meanMesh.stl"))
  (0 until 3).foreach(i => MeshIO.writeMesh(fGpaTransformedMeshes(i), new java.io.File(drive + "registration/Full GPA/" + s"mesh${i}.stl")))

}

Full GPA and Shapes

Until this moment, I did not refer to a mesh/surface as a shape, because a shape has its own technical definition. The shape of an object/configuration/mesh is all its geometrical information modulo its scale, location, and rotational effects. In other words, shape is an equivalence class of geometrical objects modulo the similarity transformation. The full GPA removes the scale, location, and rotational effects of objects/meshes (relative to a mean object/mesh) in order to fit them all onto the mean mesh, which is the sample full Procrustes mean shape \hat\mu_F. Therefore, whatever geometrical variations that remain between the mean shape and each of the full Procrustes fits (full-GPA transformed meshes) reflect the variations due to the differences/dissimilarities in their shapes.

Probability and Stats Cheat Sheet (1)

0- Conventions

1- r-Vectors in matrix operations are considered as column vectors (matrices):

x\in \mathbb{R}^r\equiv\begin{bmatrix}x_1\\x_2\\x_3\\ \vdots \\ x_r \end{bmatrix}\in\mathbb{R}^{r\times 1} \implies x^{\rm T}=\begin{bmatrix}x_1&x_2&x_3& \dots & x_r \end{bmatrix}

2- Column vector inner/dot product:

x.y \equiv (x)^\text T(y)=(y)^\text T(x)=x^\text Ty=y^\text Tx\ \in \mathbb{R} \ \forall x,y\in \mathbb{R}^{r\times 1}

1- Population expected value and covariance matrix

1-  X a random r-vector  [X_1, \dots, X_r]^\text{T} :

\mu_X= \text{E[X]}:=[\text E[X_1],\dots,E[X_r]]^\text T=[\mu_1,\dots,\mu_r]^\text T\\ \ \\ \text{E}[X^\text T]:=(\text E[X])^\text T \Sigma_{XX}:=\underset{\color{blue}{\text or\ Var(X)}}{\text{Cov}}(X,X):=\text E\big[(X-\mu_X)(X-\mu_X)^\text T\big]\ =\sigma_{ij}^2 \in\mathbb{R}^{r\times r}

where, \sigma_{ij}^2=\text{var}(X_i)=\text E[(X_i-\mu_i)^2] \  \text { for }i=j, and \sigma_{ij}^2=\text{cov}(X_i,X_j)=\text E[(X_i-\mu_i) (X_j-\mu_j) ] \  \text { for } i\ne j.

\Sigma_{XX}=\text E[XX^\text T]-\mu_X\mu_X^\text T \Sigma_{XX}=\text E[XX^\text T] \iff \mu_X=0

2- For random vectors X\in \mathbb{R}^{r\times 1} and Y\in\mathbb{R}^{s\times 1}:

\Sigma_{XY}:=\text{Cov}(X,Y):=\text E\big [(X-\mu_X)(Y-\mu_Y)^\text T\big]=\Sigma_{YX}^\text T\ \in\mathbb{R}^{r\times s} \Sigma_{XY}=\text E\big [XY^\text T\big] \iff \mu_X=\mu_Y=0

Remark: \Sigma_{XX} is a symmetric matrix but \Sigma_{XY} is not necessarily symmetric.


3- Y \in \mathbb {R}^{s \times 1} linearly related to X  \in \mathbb {R}^{r \times 1} as  Y=AX+b with A\in \mathbb {R}^{s \times r} and b\in \mathbb {R}^{s \times 1} a constant:

\mu_Y=A\mu_X + b \ \in \mathbb{R}^{s \times 1}\\ \ \\ \Sigma_{YY}=A\Sigma_{XX} A^\text T\ \in \mathbb{R}^{s \times s}

4- X,\ Y\in  \mathbb {R}^{r \times 1} and Z\in  \mathbb {R}^{s \times 1}:

\text {Cov}(X+Y,Z)=\Sigma_{XZ}+\Sigma_{XY}\\ \ \\ \text {Cov}(X+Y,X+Y)=\Sigma_{XX}+\Sigma_{XY}+\Sigma_{YX}+\Sigma_{YY}

5- X \in  \mathbb {R}^{r \times 1}, Y\in  \mathbb {R}^{s \times 1}, A\in  \mathbb {R}^{r \times s} and B\in  \mathbb {R}^{r \times s}:

\text {Cov}(AX,BY)=A\Sigma_{XY}B^\text T\ \in \mathbb {R}^{r \times s}

For the proof, expand \text E[(AX-\text E [AX])(BY-\text E [BY])^\text T].

In a special case:

\text {Var}(AX)=\text {Cov}(AX,AX)=A\Sigma_{XX}A^\text T\ \in \mathbb {R}^{r \times r}

6- \Sigma_{XX} is positive semi-definite. Proof: show that \forall u\in \mathbb{R}^{r\times 1},\  u^\text T\Sigma_{XX}u \ge 0. Use 5 in the proof.

u^\text T\Sigma_{XX}u=\text{Cov}(u^\text TX, u^\text TX)=\text{Cov}(u^\text TX, (u^\text TX)^\text T)=\text{Var}(Z)\ge 0\\ \ \\ \qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad Z:=u^\text TX \in \mathbb{R}

7- A random vector X is centred iff \mu_X=0. Any random vector can be centred by the transformation X_c=X-\mu_x. The following expressions are useful (proof is by expansion/writing the terms and induction):

\forall X\in \mathbb{R}^{r\times 1}, Y\in \mathbb{R}^{t\times 1}\\ \ \\ \Sigma_{XY}=\text E\big [(X-\mu_X)(Y-\mu_Y)^\text T\big]=\text E\big [X_cY_c^\text T\big]=\Sigma_{X_cY_c}\\ \ \\ \Sigma_{XX}=\text E\big [(X-\mu_X)(X-\mu_X)^\text T\big]=\text E\big [X_cX_c^\text T\big]=\Sigma_{X_cX_c} \text E[X_c^\text TX_c]=\text {tr}(\Sigma_{XX}) \text E[X_c^\text TC Y_c]=\text {tr}(C\Sigma_{XY}) \qquad C\in \mathbb{R}^{r\times t} \text E[Y_c^\text T C^\text T C Y_c]=\text {tr}(C\Sigma_{YY}C^\text T)

Above expressions involving the trace can have different form according to the properties of the trace.

8- Conditional probability density function:

Miscellaneous Proofs and Theorems in Probability and Statistics (1)

1- A minimization problem regarding PCA:

Given that X,\mu \in \mathbb{R}^{r\times 1},\ A \in \mathbb{R}^{r\times t},\ B \in \mathbb{R}^{t\times r}, t\le r, and A and B are both of full rank t, prove that the minimizing parameters, \{\mu,A,B\} of

\text E\big [(X-\mu – ABX)^T(X-\mu – ABX)\big ]\ \in \mathbb{R}^+

are:

A=\begin{bmatrix}|&|& \dots & |\\v_1 &v_2 &\dots &v_t\\|&|& \dots & | \end{bmatrix}=B^\text T\\ \ \\ \mu=(I_r-AB)\mu_X

where v_1,\dots v_t are the first t eigenvectors of \Sigma_{XX} such that \forall v_i,v_j\ i\lt j \iff \lambda_i(\Sigma_{XX})\lt \lambda_j(\Sigma_{XX}) (this indicates how the eigenvalues and eigenvectors are sorted). \lambda_i(\Sigma_{XX} ) is the i-th eigenvalue of \Sigma_{XX}; and I_r\in \mathbb{R}^{r\times r} is the identity matrix.

Let’s write C:=AB and X=X_c+\mu_x s.t X_c is the centred X, and do the expansion as follows:

\begin{aligned} \text E\big [(X_c+\mu_x-\mu – C(X_c+\mu_x))^T(X_c+\mu_x-\mu – C(X_c+\mu_x))\big]\\ =\text E\big[ (X_c-CX_c)^\text T(X_c-CX_c)\big]+\text E\big [ (X_c-CX_c)^\text T\big](\mu_x-\mu-C\mu_x) \\+(\mu_x-\mu-C\mu_x)^\text T \text E\big [ (X_c-CX_c) \big] \\+\text E\big [(\mu_x-\mu-C\mu_x)^\text T(\mu_x-\mu-C\mu_x)\big] \end{aligned}

As \text E[X_c]=0 and the expected value is on a constant term in the 4th term, we can write:

=\text E\big[ (X_c-CX_c)^\text T(X_c-CX_c)\big]+ (\mu_x-\mu-C\mu_x)^\text T(\mu_x-\mu-C\mu_x)

The resultant expression contains positive terms. A necessary condition to minimize this expression is to make the 2nd term attains its minimum, i.e. zero. Therefore, \mu=\mu_x-C\mu_x=(I_r-C)\mu_x.

Now, we have to find C such that it minimizes the 1st term, i.e.:

E\big[ (X_c-CX_c)^\text T(X_c-CX_c)\big]

Expanding the expression, we get the above as:

\begin{aligned} =\text E\big[(X_c^\text T X_c- X_c^\text T CX_c- X_c^\text T C^\text T X_c+ X_c^\text T C^\text T CX_c\big] \end{aligned}

By the relation to the covariance matrix (see this post), above becomes:

=\text {tr}(\Sigma_{XX}-C\Sigma_{XX}-C^\text T\Sigma_{XX}+C\Sigma_{XX}C^\text T)\ \in \mathbb{R}^+

By the properties of trace, i.e. \text {tr}(A+B)= \text {tr} (A)+ \text {tr} (B) and \text {tr}(AB)= \text {tr}(BA), we have:

=\text {tr}(\Sigma_{XX}-C\Sigma_{XX}-\Sigma_{XX}C^\text T+C\Sigma_{XX}C^\text T)

As \Sigma_{XX}=\Sigma_{XX}^\text T, and by the definition of matrix power, we can factor the above expression as:

=\text {tr}\big(\Sigma_{XX}-\Sigma_{XX}+(C\Sigma_{XX}^{\frac{1}{2}}-\Sigma_{XX}^{\frac{1}{2}})(C\Sigma_{XX}^{\frac{1}{2}}-\Sigma_{XX}^{\frac{1}{2}})^\text T\big)\\ \ \\ =\text {tr}\big((C\Sigma_{XX}^{\frac{1}{2}}-\Sigma_{XX}^{\frac{1}{2}})(C\Sigma_{XX}^{\frac{1}{2}}-\Sigma_{XX}^{\frac{1}{2}})^\text T\big)

The problem is now to find C that minimizes:

\text {tr}\big((\Sigma_{XX}^{\frac{1}{2}}-C\Sigma_{XX}^{\frac{1}{2}})(\Sigma_{XX}^{\frac{1}{2}}-C\Sigma_{XX}^{\frac{1}{2}})^\text T\big)

which is equal to:

\|\Sigma_{XX}^{\frac{1}{2}}-C\Sigma_{XX}^{\frac{1}{2}}\|_F

with \|.\|_F being the Frobenius norm of matrices.

Considering the following facts:

1- If A\in\mathbb{R}^{r\times r} is symmetric, it is diagonalizable and for q\in \mathbb{Q} the eignevalues of A and A^q are related as \lambda_i(A^q)=\lambda_i^q(A). The eigenvectors of A and A^q are the same.

2- \Sigma_{XX}^{\frac{1}{2}} is symmetric and has the same rank, r, as \Sigma_{XX}\in \mathbb{R}^{r\times r}.

3- C=AB has the rank at most t\le r. therefore, C\Sigma_{XX}^{\frac{1}{2}} has the rank at most t.

4- A partial/truncated singular value decomposition (SVD) of a matrix A \in \mathbb{R}^{m\times n} is A_k=\sum_{i=1}^{k}\sigma_i u_i v_i^\text T s.t u_i, v_i and \sigma_i are from the SVD of A, i.e. A=U\Sigma V^\text T. If A is symmetric, then A=V \varLambda V^\text T and A_k=\sum_{i=1}^{k}\lambda_i v_i v_i^\text T s.t \lambda_i is an eigenvalue of A. Note that the \lambda_i decreases with i.

5- As a corollary of Eckart-Young-Mirsky Theorem: Let A,B\ \in \mathbb{R}^{m\times n} with \text {rank}(B)=k\le  \text {rank}(A) \le \text{min}(m,n), then A_k=\argmin_{B} \|A-B\|_F.

we can write:

\argmin_B \|\Sigma_{XX}^{\frac{1}{2}}-B\|_F=\sum_{i=1}^{t}\lambda_i^{\frac{1}{2}} v_i v_i^\text T\\ \ \\ \implies C\Sigma_{XX}^{\frac{1}{2}}=\sum_{i=1}^{t}\lambda_i^{\frac{1}{2}} v_i v_i^\text T\\ \ \\ \Sigma_{XX}\text{ being positive definite} \implies C=(\sum_{i=1}^{t}\lambda_i^{\frac{1}{2}} v_i v_i^\text T)\Sigma_{XX}^{-\frac{1}{2}}\tag{1}

where \lambda_i=\lambda_i(\Sigma_{XX}), and v_i is the associated (normalized) eigenvector of \Sigma_{XX}.

As \Sigma_{XX}^\frac{1}{2}=V\varLambda^\frac{1}{2}V^\text T with \varLambda the diagonal matrix of eigenvalues of \Sigma_{XX} and V the eigenvectors’ unitary matrix, we can write: \varLambda^{-\frac{1}{2}} V^\text T\Sigma_{XX}^\frac{1}{2}=V^\text T. Therefore:

v_i^\text T=\lambda_i^{-\frac{1}{2}}v_i^\text T\Sigma_{XX}^\frac{1}{2}

Substituting the above in Eq. 1, we’ll get:

C=\sum_{i=1}^{t} v_i v_i^\text T=\begin{bmatrix}|&|& \dots & |\\v_1 &v_2 &\dots &v_t\\|&|& \dots & | \end{bmatrix} \begin{bmatrix}\text{—}v_1\text{—}\\\text{—}v_2\text{—}\\ \dots\\ \text{—}v_t\text{—} \end{bmatrix}=V^*V^{*\text T}

As C=AB, we accomplish the proof by letting A=V^*.

2- Eckart-Young-Mirsky Theorem

Let A,B\ \in \mathbb{R}^{m\times n} with \text {rank}(B)=k\le  \text {rank}(A) \le \text{min}(m,n), then \|A-A_k\|_F \le  \|A-B\|_F s.t A_k is the partial SVD of A i.e A_k=\Sigma_{i=1}^{k}\sigma_i u_i v_i^\text T.

PCA: Introduction

Different kind of inter-correlated data can be collected into a random vector. For example we would consider the total length, weight, diameter of nose holes, tail length, number of teeth, colour of eyes, the radius of acetabula, number of vertebrae of dinosaurs as data and collect them into a (random) vector. The principal component analysis (PCA) is a method for reducing the dimension of a random vector (data), and also for discovering outstanding features of data. Using PCA, one can replace a random vector with another vector having a smaller dimension without losing much information. In PCA, the measure of information loss is the total variation of the original variables collected in a random vector.

See this cheat sheet for the mathematical identities used in this post.

Least-squares optimality of PCA

Assume random variables (X_i) collected in a random r-vector X=[X_1,...,X_r]^T , i.e. X\in \mathbb{R}^{r\times 1}. It has a mean \mu_X \in  \mathbb{R}^{r\times 1} and a covariance matrix \Sigma_{XX} \in  \mathbb{R}^{r\times r}. A linear projection of the random variables means:

\xi=BX\text{ s.t. } \xi \in \mathbb{R}^{t\times 1}\text{ and }B \in \mathbb{R}^{t\times r} \\ t\le r

Each rows of B linearly combines the random variables to create a new random variable.

Let’s project the r-vector X onto a lower dimensional space and obtain the t-vector \xi with t\lt r. We want to find a linear map to reconstruct X out of \xi with lower dimension. This means finding a vector \mu \in  {R}^{r\times 1} and a matrix A \in \mathbb{R}^{r\times t} such that  X\approx \mu + A\xi. The approximation is in the least-squares sense of minimizing the mean error. We consider the following as the measure of how well the linear map can reconstruct X out of its linear projection \xi:

\text E\big [(X-\mu – A\xi)^T(X-\mu – A\xi)\big ]\ \in \mathbb{R}^{+}

Therefore, we should find \mu, A, and B by minimizing the following mean:

\text E\big [(X-\mu – ABX)^T(X-\mu – ABX)\big ]\ \in \mathbb{R}^+\tag{1}

This expression is the mean of the squared norm/magnitude of the error vector which itself is a random vevtor.

With the assumption that both A and B are of full rank t, the minimizing parameters will be (see the proof here):

A=\begin{bmatrix}|&|& \dots & |\\v_1 &v_2 &\dots &v_t\\|&|& \dots & | \end{bmatrix}=B^\text T\\ \ \\ \mu=(I_r-AB)\mu_X \tag{2}

where v_i \in \mathbb{R}^{r \times 1} is a normalized eigenvector of the covariance matrix \Sigma_{XX} and it is associated with the i-th largest eigenvalue, \lambda_i, of \Sigma_{XX}. The eigenvectors are orthonormal to each other, i.e. v_i^\text{T}v_j=\delta_{ij} (\delta_{ij} is the Kronecker delta). I_r is the identity matrix.

From the structures of the matrices A and B, it can be observed that they have rank t. Therefore, the best rank-t approximation of the original random vector X is \tilde X:

X\approx \tilde X= \mu_x+C(X-\mu_x)\qquad \text{s.t }C=AB=AA^\text T \in \mathbb{R}^{r\times r}\tag{3a}

or:

\tilde X= \mu_x+\sum_{i=1}^{t}(v_iv_i^\text T)(X_c)\qquad \text{s.t }X_c=X-\mu_x\tag{3b}

Remark: \text E\big[X\big]=\text E\big[\tilde X\big]=\mu_x, and \text{Var}(\tilde X)=C\Sigma_{XX}C^\text T= C\text{Var}(X) C^\text T \in \mathbb{R}^{r\times r}.

The covariance matrix \Sigma_{XX} \in \mathbb{R}^{r\times r} is real and symmetric, therefore it has a spectral decomposition as \Sigma_{XX}=V \varLambda V^\text T s.t V \in \mathbb{R}^{r\times r} is an orthogonal matrix (a matrix with ortho-normal columns/rows) with columns as normalized eigenvectors of \Sigma_{XX} , and  \varLambda is a diagonal matrix having the eigenvalues, \lambda_i, of \Sigma_{XX} . The eigenvectors in V are sorted such that the j-th column of V is associated with the j-th largest eigenvalue. Having said that, we observe that the best rank-t approximation of the original random vector only uses the first t eigenvectors of random vector’s covariance matrix.

The minimum of Eq. 1 is:

\sum_{i=t+1}^{r} \lambda_i \tag{4}

Principal components (PC’s) of X: the first t principal components (also known as Karhunen-Loeve transformation) of X are defined by the following linear projections:

\xi^{(t)}=A^{\text T}X\ \in \mathbb{R}^{t\times 1} \implies\xi_i^{(t)}=v_i^{\text T}X\qquad i=1,\dots ,t\le r\tag{5}

The superscript (t) is used to indicate the dimension of the PC vector, being equal to the number of PC’s.

Note that PC’s are themselves random variables as they are linear combination of the data random variables.

PC’s of X are uncorrelated: the covariance between each pair of the principal components is:

\text {Cov}(\xi_i,\xi_j)=\text {Cov}(v_i^{\text T}X,v_j^{\text T}X)= v_i^{\text T}\Sigma_{XX}v_j=\lambda_j v_i^{\text T}v_j=\delta_{ij}\lambda_j\\ \tag{6}

Interestingly \text {Cov}(\xi_i,\xi_j)=0\ \text{ for } i\neq j. This means that principal components of X are uncorrelated. Moreover, \text {Cov}(\xi_i,\xi_i)= \text {Var}(\xi_i)\lt  \text {Var}(\xi_j)\iff  i\lt j.

A goodness-of-fit measure: equation 4 maps the r-dimensional data (X) into a lower t-dimensional data \xi. The total variance of the original data, i.e. X_i collected in X is \sum_{i=1}^{r}\text{Var}(X_i)=\text{Trace}(\Sigma_{XX})= \sum_{i=1}^{r}\lambda_i. Based on this, a measure on how well the first t principal components (in the lower r-D space) represent the r original data variables is considered as (also note Eq. 4 and 6):

\frac{\sum_{i=t+1}^{r} \lambda_i}{\sum_{i=1}^{r} \lambda_i}=\frac{\text{total variance of }X_i\text{\textquoteright s}-\text{total variance of }\xi_i\text{\textquoteright s}}{\text{total variance of }X_i\text{\textquoteright s}}\tag{7}

This implies that the measure is small if the first t principal components contain a large part of the total variation in X; hence, better approximation.

Remark: if t=r, Eq. 7 becomes zero and it means perfect match, i.e no approximation. In other words, t=r means A=V (V as in the SVD of \Sigma_{XX}= V \varLambda V^\text T), then because V is unitary, C=VV^\text T=I_r, hence (Eq. 3),  \tilde X= \mu_x+I(X-\mu_x)=X.

PCA as a variance-maximization method

The main aim of PCA is to reduce the dimension of data. The way a dinosaur would do a sort of reduction on a data vector is keeping one component and eating the rest all. It is a perfectly nonsense reduction, but many pieces of valuable information are lost. Another way, is to do a simple average r^{-1}\sum_{i=1}^{r}X_i. This approach is yet undesirable because it treats all components as equal importance (equal weight). Another approach is to do a weighted average with some constraint on the weights, like: \sum_{i=1}^{r}w_iX_i such that \sum_{i=1}^{r}w_i^2=1. In matrix notation and using the Euclidean vector norm:

\sum_{i=1}^{r}w_iX_i=w^\text TX \quad \text{ s.t }\quad \sum_{i=1}^{r}w_i^2=w^\text T w=\|w\|^2=1\qquad w_i\in \mathbb{R}

This constraint weighted average is called a standardised linear combination (SLC). However, the problem is how to uniquely and meaningfully determine the weights. PCA chooses the weights in order to maximize the variances of the SLC’s of the variables X_i‘s, or to maximize the variances of the linear projections of the random vector, i.e. w^\text T X,\ \in\mathbb{R}. SLC, is in fact the projection of the vector X onto the direction described by the weight vector. In a mathematical notation, we want to solve the following maximization problem:

\max_{\{w:\|w\|=1\}}\text{Var}(w^\text TX)=\max_{\{w:\|w\|=1\}}w^\text T\text{Var}(X)w=\max_{\{w:\|w\|=1\}}w^\text T\Sigma_{XX}(X)w\\ \ \\ w\in \mathbb{R}^{r\times 1},\quad X\in \mathbb{R}^{r\times 1} \quad,\Sigma_{XX}\in \mathbb{R}^{r\times r} \ \tag{8a}

In other words,

v=\argmax_{\{w:\|w\|=1\}}\text{Var}(w^\text TX)\tag{8b}

Principal components of X are defined based on this maximization. The 1st PC of X, denoted by \xi_1 is the SLC (of the variables) with the highest variance. This is obtained by the maximizing problem expressed by Eq. 8. \text{PC}_1 then has the highest variance among all other SLC of the variables/projections of X:

v_1=\argmax_{\{w:\|w\|=1\}}\text{Var}(w^\text TX)\qquad v_1\in \mathbb{R}^{r\times 1} \\ \ \\ \text {PC}_1:=\xi_1:=v_1^\text TX\qquad \xi_1\in\mathbb{R}\ \tag{9}

v_1 is called the 1st principal direction (PD) or principal axis.

The 2nd PC of X, denoted by \xi_2 is the SLC (of the variables) with the highest variance such that the maximizing weight vector,v_2 is also subjected to the orthogonality constraint v_2^\text T v_1=0. In a mathematical notation, the 2nd PC is obtained by the following maximization:

v_2=\argmax_{\{w:\|w\|=1 \text{ and }w^\text Tv_1=0\}}\text{Var}(w^\text TX) \\ \ \\ \text {PC}_2:=\xi_2:=v_2^\text TX\qquad \ \tag{10}

by the same token, the i-th PC and i-th PD or principal axis are defined by the following:

v_i=\argmax_{\{w:\|w\|=1\text{ and }w^\text T v_j=0 \text{ for all } j=1,\dots,i-1\}}\text{Var}(w^\text TX) \\ \ \\ \text {PC}_i:=\xi_i:=v_i^\text TX\qquad \ \tag{11}

It can be proved (see the references) that the maximizing weight vectors are the normalized eigenvectors of \Sigma_{XX} associated with its eigenvalues sorted in the decreasing order, i.e.:

\Sigma_{XX}=V\varLambda V^\text T\ \text{ s.t } \lambda_{i}\ge \lambda_{j} \implies i\lt j\\ \ \\ \text{i.e. }\lambda_1\ge \lambda_2\ge \dots \ge\lambda_r \ge 0

Therefore:

\text{cov}(\xi_i,\xi_j)=\text {Cov}(v_i^\text TX,v_j^\text TX)=v_i^\text T\Sigma_{XX} v_j=\lambda_j v_i^{\text T}v_j=\delta_{ij}\lambda_j\\ \ \\ \implies \text{Var}(\xi_1)\ge \text{Var}(\xi_2)\ge\dots\ge\text{Var}(\xi_r)\ge0

which means PC’s not only have the maximum variances among all the possible SLC of the variables, or all the possible projections of the random vector, but also they are mutually uncorrelated.

In a matrix form, we can write:

\xi=V^\text TX\qquad \xi,X\in\mathbb{R}^{r\times 1},\ V^\text T\in\mathbb{R}^{r\times r}\tag{12}

where X and \xi are the random vector and its linear projection vector containing the PC’s of X respectively.

It is readily concluded that:

\text{Var}(\xi)=\text{Var}(V^\text TX)=V^\text T\Sigma_{XX}V=\delta_{ij}\lambda_j

PCA in practice: Sample PCA

In practice, the PC’s are to be estimated out of n independent observations of the random r-vector X of data. This requires estimating of the mean and covariance matrix of the random vector of data; using sample mean and covariance.

Here, we use the notation \bold X_i\in \mathbb{R}^{r\times 1} for the i-th independent trial to observe the random vector X.

Note: we denote an observation vector with a capital letter \bold X_i to indicate that it is still a random vector (out of the same population and through independent trials). To indicate a particular realization/value of \bold X_i we shall use small letter (by convention).

We collect the sequence of n independent observations of the random vector X in a set as:

\{\bold X_i, i=1,2,\dots,n\}\ \text{ s.t }\bold X_i\in\mathbb{R}^{r\times 1}\tag{13}

The estimate for \mu_X is given by:

\hat \mu_X=\bold{\bar X}:=\frac{1}{n}\sum_{i=1}^{n}\bold X_i\tag{14}

let \bold X_{ci}=\bold X_i-\bold{\bar X} for n observations of the random vector; and let \chi_c={\bold X_{c1},\dots,\bold X_{cn}}\in \mathbb{R}^{r\times n}. Then, the estimate for the covariance matrix is:

\hat\Sigma_{XX}=\frac{1}{n}\chi_c\chi_c^\text T\tag{15}

Ordered eigenvalues and their associated eigenvectors of \hat\Sigma_{XX} is then estimate the population parameters. This leads to writing:

\hat\lambda_1\ge\hat\lambda_2\ge\dots\ge\hat\lambda_r\ge 0\\ \ \\ \hat A=(\hat v_1,\dots,\hat v_t)=\hat B^\text T\tag{16}

In practice, the vectors of independent observations are collected as columns (or rows) of a matrix (just re-writing Eq. 13). Let’s call this matrix the matrix of observations. Here, each column represents an observation of X:

\bold X=\begin{bmatrix}|&|&\dots &|\\ \bold X_1 & \bold X_2 &\dots &\bold X_n \\|&|&\dots &| \end{bmatrix}\in \mathbb{R}^{r\times n}\tag{17}

Recalling Eq. 5 or Eq. 12, PC’s can be calculated for each columns of the matrix of observations. In this regards, we obtain estimates of the PC’s of X:

\hat\xi_{ij}^{(t)}=\hat v_j^\text T\bold X_i\ \in\mathbb{R}\\ j=1,\dots,t\ \text{ and }\ i=1,\dots,n

This is the estimate of the j-th PC of X calculated based on the i-th observation \bold X_i. The term \hat\xi_{ij}^{(t)} is called the score for the i-th observation on the j-th PC.

For each observation, we can write the above in a more compact from:

\hat\xi_{i}^{(t)}=\hat A^\text T\bold X_i\ \in\mathbb{R}^{t\times 1}\\ i=1,\dots,n\tag{18}

References

– Modern Multivariate Statistical Techniques, A. J. Izenman.
– Applied Multivariate Statistical Analysis, W. Hardle, L. Simar.

Although mostly paraphrased, this post might contain copyrighted sentences of the references.

Scala Fundamentals (1)

1- Block expression

The intermediate results (vals/vars) in a block expression are local. Example:

val firstName = "Polar"
def name: String = {
  println("This is a block")
  val title = "Mr."
  val lastName = "Bear"
  title + " " + firstName + " " + lastName
}
println(name)
println(title)

Output:
firstName: String = Polar
name: String
This is a block
Mr. Polar Bear
Error:(9, 17) not found: value title
println(title)

Block expression are useful for evaluating intermediate expressions and assigning the result to a val/var.

2- Objects and classes

Everything is a kind of object in Scala; even primitive types, Int, Boolean, are objects. Also the program code to be run is an object and is defined in the form/pattern of an object. All objects can have methods and fields (attributes). Methods of an objects give functionality to the object to do task or computation on data. Fields are for storing data. Every object belongs to a class that have only one instance.

A class is an abstract template for creating objects that have similar methods and fields. A class also defines/create a Type (of objects) and therefore objects of the same class have the same Type. A class is not a value (like functions and objects) and it lives in its own namespace. Scala has two namespace: namespace of values, and namespace of types. Classes live in the type namespace.

The functionality of an object (if it has one) can be called like a function. A function is a first class value whereas a method belongs to an object. For example:

object addNumbers{
  def addEm (a: Int, b: Int) = println(a + b)
}
addNumbers.addEm(1,2)

// compare with

object addNumbers{
  def apply (a: Int, b: Int) = println(a + b)
}
addNumbers(1,2)

Function-like application of any object is available by using the apply method.

An executable Scala program is an object with an entry point, i.e. the main method. An object can be made executable (or an executable Scala code) by either adding a main method, or making the object extends the type App:

object MyProgram{  // this is a singleton objec
  import ... //import libraries
   def main(args : Array[String]) : Unit = {
     // Array[String] are used for the command line arguments
     // body of the main program to do stuff
   }
}

// or

object MyProgram extends App {
   // body of the main program to do stuff
}

1- Singleton object

Is an objects which is unique and we cannot create other values of this object. Singleton object belongs to a class that have only one instance. As it can be seen the type of a singleton object defined as bellow is itself. The term Box$@3ae78a4c indicated the type @ a unique (reference) identifier of the object. All types of objects have unique identifiers.

object Box
res0: Box.type = Box$@3ae78a4c

2- Object (normal)

An object should belong to a predefined class. Example:

class Animal(val kind: String, val location: String) {
  def loc = println(kind + " lives in " + location)
}

val polarbear_1 = new Animal("Bear", "north pole")
polarbear_1.loc
polarbear_1.kind

// results
defined class Animal
polarbear_1: Animal = Animal@669cb6c7

Bear lives in north pole
res2: String = Bear

The class Animal creates a new type Animal that can be used like any other types. This is a type of an object.

3- Companion Objects

A companion object of a class is a singleton object with the same name as a class (and defined in the same file as the class’s file). The class is also called the companion class of the object. These two live in different namespaces, i.e. value and type namespaces. It is then important to note that the companion object is not an instant of its companion class. One application is to have the (singleton) companion object manage and implement the constructors of an instant (object) of the companion class. In this case the companion object is an auxiliary tool. For example:

class Line(val point_1:Double, val point_2:Double){
  def printPoints = println(point_1,point_2)
}
val line_1 = new Line(0.0, 1.0) // Line is the name of the class
line_1.printPoints

// Compare with

class Line(val point_1:Double, val point_2:Double){
  def printPoints = println(point_1,point_2)
}

object Line{
  def apply(point_1:Double, point_2:Double): Line = new Line(point_1,point_2)
}

val line_1 = Line(0.0,1.0)  // Line is the name of the object 
line_1.printPoints

4- Case classes

Case classes are useful shorthand for defining a class, its companion objects and several useful features at once. They are ideal for lightweight data-holding classes. Defining a case class is as of a class, however, with adding the literal case, and the val keyword for the constructors is optional. It also does not need a new keyword as there is a companion object with an apply method for the purpose of creating the case class. Example:

case class Animal (kind:String, location:String){
  def loc = println(kind + " lives in " + location)
}

// instantiate a case class using its class
val polarbear_1 = new Animal("polar bear", "north pole")
polarbear_1.loc

// instantiate a case class using its companion object automatically built
val polarbear_2 = Animal("polar bear", "north pole")
polarbear_2.loc

Features of case classes

1- Sensible equal and hashCode methods acting on the field values of the objects rather than the reference identity. Example:

polarbear_1 == polarbear_2
// result
res2: Boolean = true // but two different values

polarbear_1.eq(polarbear_2)  //eq method compares the reference ids
//result
res2: Boolean = false

2- Copy method

val polarbear_3 = polarbear_1.copy()  // it is a shallow copy

3- Pattern matching using case classes. Pattern match is an extended if.

To match against the constructor values. Example:

case class Animal (kind:String, location:String){
}

object inspectAnimal{
  def inspect(anAnimal: Animal): String =
    anAnimal match {  // it reads: check if anAnimal matches a case and then output as insructed
      case Animal("bear", "north pole") => "this is a polar bear"
      case Animal("bear", "Banff") => "this is a Grizly or black bear"
      case Animal(_, "Banff") => "this might be a bear" 
      case Animal(its_kind,its_location) => s"this is a/an $its_kind living in $its_location"
		//binds its_kind, its_location to the constructor values    
    }
  
}

val animal_1 = Animal("bear", "north pole")
val animal_2 = Animal("Marmot", "Banff")
val animal_3 = Animal("cat", "my house")

inspectAnimal.inspect(animal_1) // result: this is a polar bear
inspectAnimal.inspect(animal_2) // result: this might be a bear
inspectAnimal.inspect(animal_3) // result: this is a/an cat living in my house

To mach against the type. Example: https://docs.scala-lang.org/tour/pattern-matching.html.

5- Abstract class

3- Tips with objects and classes

1- Overloading the methods (overloaded methods)

Instead of creating several methods with different names and doing the same task but receiving different number of arguments and/or different types of parameters, we can create methods with the same name but different parameter and let Scala decide on the proper version of the method. Overloading methods is applicable to classes and singleton objects. Example:

object doSomething{
  def sumNumbers(a:Int, b:Int): Unit = println(a + b)
  def sumNumbers(a:Double, b:Double): Unit = println(a + b)
  def sumNumbers(a:Int, b:Int, c:Int): Unit = println(a + b + c)
}

doSomething.sumNumbers(2,3) // 5
doSomething.sumNumbers(12.345,34.567) // 46.912
doSomething.sumNumbers(1,2,7) // 10

2- Auxiliary constructor

Other than the primary constructors of a class, auxiliary constructors can also be utilized. Auxiliary constructors overload the constructors. To do this, the following should be considered:

  • Auxiliary constructors are defined by def keyword.
  • Like Method Overloading, all auxiliary constructors should use the same name this.
  • each auxiliary constructor must have a different different parameters list.
  • each auxiliary constructor must first call a previously defined primary constructor(s) or/and an auxiliary constructor(s). This is performed by name this.

Nevertheless, constructor overloading can be implemented by setting default values for the primary constructors.

4- Traits (Subtyping)

Classes are abstraction over objects while traits are abstraction over classes; in other words, traits are template for classes. Traits allow us to define a super type for several classes which are the subtypes. This super type is is outside of the Any super type that all classes share. Classes under a trait can share the implementation of the same operation. A trait creates interface that any subtype, i.e. classes, must implement (comply with).

Compared with a class, a trait does not/cannot logically have a constructor. Classes (having constructors) can be created from a trait and objects from the classes. A trait can define abstract fields and methods that have names and type signature but no (concrete) implementation. An implementation is possible when defining an extending class under the trait. Implementing methods based on abstract methods in a trait is possible. Whatever abstract methods and fields are defined in a trait have to be then implemented in the extending class(es). Concrete (default) implementation of any method (or a field) in a trait is possible as well, however, it must be make sure that the method is valid in all the extending classes (subtypes). Otherwise, the method (or field) should be overridden in a subtype.

Abstract fields in a trait can be defined either using val or def. However, it is recommended to use def because def is a generalization of val in Scala. If an abstract field is defined by def, then val will be used for its implementation in the extending class.

Abstract methods (or fields) can have completely different implementations in different subtypes.

Example: Animals can be wild or pet. Instead of creating two separate non-connected (in terms of sharing fields and methods) classes, we create two extending classes under a trait.

trait Animal {
  def kind: String  // can also be val kind: String
  def location: String
  def inspect: Unit = println(s"$kind lives in $location")
  def id: Unit
}

case class Wild (kind: String, location: String, idNumber: Int) extends Animal {
  def id: Unit = print(s"The wild animal id is: $idNumber")
}

case class Pet (kind: String, location: String, idNumber: Int) extends Animal{
  def id: Unit = print(s"The pet id is: $idNumber")
  
}

val bear_1 = Wild("bear", "north pole", 1212123)
val cat_1 = Pet ("Cat", "my house", 123456)
bear_1.inspect
bear_1.id
cat_1.inspect
cat_1.id
//results
/*
defined trait Animal

defined class Wild

defined class Pet

bear_1: Wild = Wild(bear,north pole,1212123)

cat_1: Pet = Pet(Cat,my house,123456)

bear lives in north pole
The wild animal id is: 1212123
Cat lives in my house
The pet id is: 123456
*/

Scala with IntelliJ IDEA

1- Defining a package object

  1. On the project side bar right click on Scala folder –> new –> package.
  2. name the package custom.methods.
  3. Right click on custom.methods –>new –> package object

The package object is now ready.

package custom
package object methods {
// define your custom classes, case classes, objects, methods here
}

Miscellaneous Proofs for SSM (1)

1) The sum of the squared of distances is minimized by the vector means

Prove \hat\mu=\frac{1}{n}\sum_{i=1}^{n}X_i=\argmin_{\mu}\sum_{i=1}^{n} |X_i-\mu|^2. X_i \text{ and } \mu are vectors and \|.\| is a vector norm coming from the inner product defined on the vector space; \|v\|^2=\lang v,v \rang.

Proof:

\begin{aligned} \argmin_{\mu} \sum_{i=1}^{n} \|X_i-\mu\|^2=\argmin_{\mu}\sum_{i=1}^{n} \lang X_i-\mu , X_i-\mu \rang \\ =\argmin_{\mu}\sum_{i=1}^{n} \lang X_i,X_i \rang -2\lang X_i,\mu \rang + \lang\mu,\mu \rang \\ \text{as} \lang X_i,X_i\rang \text{ is a constant not depending on } \mu \\ =\argmin_{\mu}\sum_{i=1}^{n} -2\lang X_i,\mu\rang + \lang \mu,\mu \rang\\ =\argmin_{\mu}(n\lang \mu,\mu \rang-2 \sum_{i=1}^{n} \lang X_i,\mu \rang)\\ \text{by the linearity of inner product}\\ =\argmin_{\mu}(n\lang \mu,\mu \rang -2\lang\sum_{i=1}^{n} X_i,\mu\rang)\\ =\argmin_{\mu}(n\lang \mu,\mu \rang -2n\lang\frac{1}{n}\sum_{i=1}^{n} X_i,\mu\rang)\\ =\argmin_{\mu}(n\lang \mu,\mu \rang -2n\lang\ \overline{X},\mu\rang)\\ \text{dividing by a constant, n, won’t affect the }\argmin\\ =\argmin_{\mu}(\lang \mu,\mu \rang -2\lang\ \overline{X},\mu\rang)\\ \text{adding the constant} \lang \overline{X},\overline{X} \rang \text{ won’t affect the } \argmin \text{ therefore: }\\ =\argmin_{\mu}(\lang \mu,\mu \rang -2\lang\ \overline{X},\mu\rang +\lang \overline{X},\overline{X} \rang)\\ =\argmin_{\mu}(\lang \mu – \overline{X},\lang \mu – \overline{X}\rang )\\ =\argmin_{\mu}(\| \mu – \overline{X}\|^2 )\\ \end{aligned}

Because \|.\| is larger than or equal zero, the minimum of the norm is zero, hence:

\hat\mu=\argmin_{\mu}(\| \mu – \overline{X}\|^2 )=\overline{X}\qquad\checkmark

In a case that X_i and \mu are matrices, the norm is the Frobenius norm and the inner product is the Frobenius inner product.

2) Proof regarding the full GPA method

Proving the following:

\inf_{\beta_i,\Gamma_i,\gamma_i}\sum_{i=1}^{n} \|(\beta_i X_i\Gamma_i+1_k\gamma_i^\text{T}) – \frac{1}{n} \sum_{j=1}^{n} (\beta_j X_j\Gamma_j+1_k\gamma_j^\text{T})\|^2\\ =\\ \inf_{\beta_i,\Gamma_i,\gamma_i}\frac{1}{n}\sum_{i=1}^{n} \sum_{\underset{{j\lt n}}{j=i+1}}^{n}\|(\beta_i X_i\Gamma_i+1_k\gamma_i^\text{T}) – (\beta_j X_j\Gamma_j+1_k\gamma_j^\text{T})\|^2

The “inf” operator is not restated in the proof. Let A_i:=\beta_i X_i\Gamma_i+1_k\gamma_i^\text{T}. Therefore:

\sum_{i=1}^{n} \|A_i – \frac{1}{n} \sum_{j=1}^{n} A_j\|^2 =\frac{1}{n^2}\sum_{i=1}^{n} \|nA_i – \sum_{j=1}^{n} A_j\|^2\tag{I}

The Frobenius norm comes from the Frobenius inner product, i.e.  \forall A\in \mathbb{R}^{k\times m}\  ,\|A\|^2=\lang A,A \rang. Therefore, above continues as:

=\frac{1}{n^2}\sum_{i=1}^{n} \lang nA_i – \sum_{j=1}^{n} A_j,nA_i – \sum_{j=1}^{n} A_j\rang

Inner product is a bi-linear form and hence by its linearity and commutative properties:

\begin{aligned} =\frac{1}{n^2}\sum_{i}( \lang nA_i – \sum_{j} A_j,nA_i\rang – \lang nA_i – \sum_{j} A_j,\sum_{j} A_j\rang)\\ =\frac{1}{n^2}\sum_{i}( \lang nA_i,nA_i\rang – \lang nA_i, \sum_{j} A_j\rang- \lang \sum_{j} A_j,nA_i\rang \\+ \lang \sum_{j} A_j,\sum_{j} A_j\rang)\\ =\frac{1}{n^2}\sum_{i}( n^2\lang A_i,A_i\rang – 2n\lang A_i, \sum_{j} A_j\rang + \lang \sum_{j} A_j,\sum_{j} A_j\rang) \end{aligned}

Again by applying the linearity property of inner product on the finite sums:

=\frac{1}{n^2}\sum_{i}( n^2\lang A_i,A_i\rang – 2n\sum_{j}\lang A_i, A_j\rang + \sum_{l}\sum_{j}\lang A_l, A_j\rang)

Applying the \sum_{i}:

\begin{aligned} =\frac{1}{n^2}\sum_{i}( n^2\lang A_i,A_i\rang – 2n\sum_{j}\lang A_i, A_j\rang + \sum_{l}\sum_{j}\lang A_l, A_j\rang)\\ =\frac{1}{n^2}(n^2\sum_{i}\lang A_i,A_i\rang – 2n\sum_{i}\sum_{j}\lang A_i, A_j\rang\\ + \sum_{i}\sum_{l}\sum_{j}\lang A_l, A_j\rang)\\ =\frac{1}{n^2}(n^2\sum_{i}\lang A_i,A_i\rang – 2n\sum_{i}\sum_{j}\lang A_i, A_j\rang + n\sum_{l}\sum_{j}\lang A_l, A_j\rang)\\ \text{As the last two double-sigma terms are identical:}\\ =\frac{1}{n}(n\sum_{i}\lang A_i,A_i\rang – \sum_{i}\sum_{j}\lang A_i, A_j\rang )\tag{II} \end{aligned}

The rest can be done by induction, but for simplicity we continue with writing a number of terms to observe the pattern and reach the final result. With |A-B|^2=\lang A,A\rang -2\lang A,B\rang + \lang B,B\rang , and letting n=4, the expansion of the above term in the parentheses (last equation) becomes:

(4-1)\lang A_1,A_1\rang+(4-1)\lang A_2,A_2\rang+(4-1)\lang A_3,A_3\rang+(4-1)\lang A_4,A_4\rang\\ -2\lang A_1,A_2\rang -2\lang A_1,A_3\rang-2\lang A_1,A_4\rang -2\lang A_2,A_3\rang -2\lang A_2,A_4\rang\\ -2\lang A_3,A_4\rang\\ =\lang A_1,A_1\rang -2\lang A_1,A_2\rang +\lang A_2,A_2\rang\\ +\lang A_1,A_1\rang -2\lang A_1,A_3\rang +\lang A_3,A_3\rang\\ +\lang A_1,A_1\rang -2\lang A_1,A_4\rang +\lang A_4,A_4\rang\\ +\lang A_2,A_2\rang -2\lang A_2,A_3\rang +\lang A_3,A_3\rang\\ +\lang A_2,A_2\rang -2\lang A_2,A_4\rang +\lang A_4,A_4\rang\\ +\lang A_3,A_3\rang -2\lang A_3,A_4\rang +\lang A_4,A_4\rang\\ \begin{aligned} =\|A_1-A_2\|^2+\|A_1-A_3\|^2+\|A_1-A_4\|^2+\|A_2-A_3\|^2\\ +\|A_2-A_4\|^2+\|A_3-A_4\|^2 \\ =\sum_{i=1}^{4}\sum_{\underset{{j\lt 4}}{j=i+1}}^{4}\|A_i-A_j\|^2 \end{aligned}

Consequently, for any number of terms Eq. II becomes \frac{1}{n}\sum_{i=1}^{n}\sum_{\underset{{j\lt n}}{j=i+1}}^{n}\|A_i-A_j\|^2. Hence, recalling Eq. I, we complete the proof:

\sum_{i=1}^{n} \|A_i – \frac{1}{n} \sum_{j=1}^{n} A_j\|^2=\frac{1}{n}\sum_{i=1}^{n}\sum_{\underset{{j\lt n}}{j=i+1}}^{n}\|A_i-A_j\|^2\qquad\checkmark