# Tensors 1

#### Notations

Einstein summation convention is used here. A matrix M is denoted as and its ij-th element is referred to by . Quantities or coefficients are indexed as for example , or . These indices do not automatically pertain to row and column indices of a matrix, but the quantities can be presented by matrices through isomorphisms once their indices are freely interpreted as rows and columns of matrices.

## Coordinates of a vector

Let be a n-dimensional vector space and with be a basis for . Then, we define the coordinate function as,

such that for a vector written by its components (with respect to ) as the function acts as,

The coordinate function is a linear map.

## Change of basis for vectors

Let and be two basis for , then,

and

where the indices of the scalar terms and are intentionally set this way. So, if all are collected into a matrix , then the sum is over the rows of the matrix for a particular column. In other words, we can utilize the rule of matrix multiplication and write,

The same is true for . In above formulations, note that is a dummy index (i.e. we can equivalently write )

Setting as the initial (old) basis and writing the current (new) basis in terms of is referred to as forward transform denoted by . Relatively, is called backward transform.

The relation between that forward and backward transforms is obtained as follows,

We now find how vector coordinates are transformed relative to different bases. A particular can be expressed by its components according to any of or basis, therefore,

To find the relation between and we write,

As it can be observed, the old basis to new basis is transformed by the forward transform while the old coordinates are transformed to the new ones, , by the backward transform . Because the coordinates of behave contrary to the basis vectors in transformation, the coordinates or the scalar components are said to be contravariant. A vector can be called a contravariant object because its scalar components (coordinates) transforms differently from the basis vectors whose their linearly combination equals to the vector. Briefly,

Proposition: Let . Then, the scalar components/coordinates are transformed by if and only if the basis vectors are transformed by , such that .

Later, a vector is called a contravariant tensor. For the sake of notation and to distinguish between the transformations of the basis and the coordinates of a vector, in index of a coordinate is written as superscript to show it is contravariant. Therefore,

## Linear maps and linear functionals

Definition: is defined as the space of all linear maps where the domain and codomain are vectors spaces.

It can be proved that is a vector space , hence, for and

Note that the addition on the LHS is an operator in and the addition on the RHS is an operator in .

Proposition 1: Let , i.e a linear map from a vector space to another one . If is a basis for , and for and , then is uniquely defined over .

This proposition says a linear map over a space is uniquely determined by its action on the basis vectors of that space. In other words, if and then . proof: let (given by the nature of ), then for such that , we can write , therefore, . Because, ‘s are unique for (a particular) then is unique for and hence must be unique for any . In other word, there is only one unique over such that .

As a side remark, if is a basis for , hence spanning , then spans the range of ; The range of is a subset of .

By this proposition, a matrix completely determining a linear can be obtained for the linear map. let be n-dimensional with a basis , and be m-dimensional with a basis . Then there are coefficients such that,

In the notation , the index is superscript because for a fixed and hence a fixed , the term is the coordinate of and it is a contravariant (e.g ).

For , and , with the coordinates and , we write,

This expression can be written as a matrix multiplication of , where presented by its elements as,

As a remark, above can be viewed as columns of the matrix and written as,

### Linear functional (linear form or covector)

Definition: a linear functional on is a linear map . The space is called the dual space of .

Proposition: Let and be defined as . Then, called dual basis of , is a basis of , and hence .

Proof: first we show that ‘s are linearly independent, i.e. . Note that on the RHS, . For a we can write and assume . Then,

Since is arbitrary, ■ .

Now we prove that spans . I.e such that . To this end, we apply both sides to a basis vector of and write which implies or explicitly is found as . Consequently, ■.

Consider and . If , then the matrix of the linear functional/map is

So, for as we can write,

Result: if the coordinates of a vector is shown by a column vector or single-column matrix (which is a vector in the space of ), then a row vector or a single-row matrix represents the matrix of a linear functional.

Definition: a linear functional , which can be identified with a row vector as its matrix, is also called a covector.

Like vectors, a covector (and any linear map) is a mathematical object that is independent of a basis (i.e. invariant). The geometric representation of a vector in (or by an isomorphism in) is an arrow in . For a covector isomorphic to , the representation is a set (stack) of planes in that can be represented by iso lines in . A covector that is isomorphic to can be represented by iso surfaces in .

Example: Let be a basis of and be the matrix of a covector in some . Then, if , we can write,

which, for different values of , is a set of (iso) lines in a Cartesian CS defined by two axes and along and that are the geometric representations of and . The Cartesian axes are not necessarily orthogonal.

If we chose any other basis for , then the matrix of the covector changes. Also, the geometric representations of are different from and and hence the geometric representation of the covector stays the same shape.

Example: Let be a basis of and be a basis for . This means and . Then, the matrix of each dual basis vector is as,

### Change of basis for covectors

Let and be two bases for , and hence, and be two bases for . Each dual basis vector can be written in terms of the (old) dual basis vectors by using a linear transformation as . Now, the coefficients are to be determined as follows,

Using the formula regarding the change of basis of vectors, the above continues as,

This indicates that the dual basis are transformed by the backward transformation. Referring to the index convention, we use superscript for components that are transformed trough a backward transformation. Therefore,

meaning that dual basis vectors are contravariant because they behave contrary to the basis vectors in transformation from to .

Now let . Writing and using the above relation, we get,

meaning that they are transforming in a covariant manner when the basis of the vector space changes from to .

Briefly the following relations have been shown.

### Basis and change of basis for the space of linear maps

As can be proved is a linear vector space and any linear map is a vector. Therefore, we should be able to find a basis for this space. If is n-dimensional and is m-dimensional, the is mn-dimensional and hence its basis should have vectors, i.e. linear maps. Let’s enumerate the basis vectors of as for and , then any linear map can be written as,

By proposition 1, any linear map is uniquely determined by its action on the basis vectors of its codomain. If be a basis for , then for any basis vector ,

Setting a basis for as , the above equation becomes,

Note that s are dummy indices. This equation holds if,

Therefore, we can choose a set of basis vectors for as,

By recruiting the basis of , the above can be written as,

The term is obviously a linear map . It can be readily shown that being a linear combinations of the derived basis vectors is linearly independent, i.e. for any (here, note that ).

So, a linear map can be written as a linear combination . Here, it is necessary to use the index level convention. To this end, we observe that for a fixed the term couples with and represents the coordinates of a covector. As coordinates of a covector are covariant, index is written as subscript. For a fixed though, the term couples with and represents the coordinates of a vector. As coordinates of a vector are contravariant, index should rise. Therefore, we write,

The coefficients can be determined as,

Stating that are the coordinates of with respect to the basis of , i.e. . Comparing with what was derived as , we can conclude that . Therefore,

The above result can also be derived from as follows.

Change of basis of is as follows.

For , let and be bases for , and and be bases for . Also, and are corresponding bases of . Forward and backward transformation pairs in and are denoted as and .

Note that the coordinates of a linear map need two transformations such that the covariant index of pertains to the forward transformation and the contravariant index pertains to the backward transformation.

Example: let , then,

If the matrices, , , and are considered, we can write,

## Bilinear forms

A bilinear form is a bilinear map defined as . Setting a basis for , a bilinear form can be represented by matrix multiplications on the coordinates of the input vectors. If is a basis for , then

which can be written as,

where with .

The expression indicates that a bilinear form is uniquely defined by its action on the basis vectors. This is the same as what was shown for linear maps by proposition 1. This comes from the fact that a bilinear form is a linear map with respect to one of its arguments at a time.

Now we seek a basis for the space of bilinear forms, i.e. . This is a vector space with the following defined operations.

The dimension of this space is , therefore, for any bilinear form there are bilinear forms such that,

From the result we can conclude that

Following the index level convention, the indices of should stay as subscripts because each index pertains to the covariant coordinates of a covector after fixing the other index.

If and are two bases for , then the change of basis of the space of bilinear forms are as follows.

Example: the metric bilinear map (metric tensor)

Dot/inner product on the vector space over is defined as a bilinear map such that, and . With this regard two objects (that can have geometric interpretations for Euclidean spaces) are defined as,

1- Length of a vector
2- Angle between two vectors

Let see how the dot product is expressed through the coordinates of vectors. With being a basis for , we can write,

The term is called the metric tensor and its components can be presented by an n-by-n matrix as .

If the basis is an orthonormal basis, i.e. , then and is the identity matrix. Therefore, and .

## Multilinear forms

A general multilinear form is a multilinear map defined as , where is a vector space. Particularly setting leads to a simpler multilinear form as .

Following the same steps as shown for a bilinear map, a multilinear form can be written as,

which implies,

showing that a multilinear form can be written as a linear combination of covectors.

## Multilinear maps

A multilinear map is a map . A multilinear map can be written in terms of vector and covector basis. For example, consider as with and being bases for and . We can write,

Because for each and we have , we can write,

We write the indices and in as subscripts in accordance with their position on the LHS; however, we’ll see that is a coordinate of a covector for each or when is fixed. Combining above, we can write,

The term collects coefficients and uniqully defines the multilinear map. We can imagine this term as a 3-dimensional array/matrix. Above also shows that the multilinear map can be written as a linear combination of basis covectors and basis vectors.

## Definition of a tensor

Defining the following terms,

• Vector space and basis and another basis .
• Basis transformation as , and therefore .
• The dual vector space of as .
• Vector space and basis and another basis .
• Basis transformation as , and therefore
• Linear map .
• Bilinear form .

we concluded that,

It is observed that if a vector is written in terms a single sum/linear combination of basis vectors of , then the components of the vectors change contravariantely with respect to a change of basis. Then, the covectors are considered and it is observed that their components change covariently upon change of basis of or . A linear map can be written as a linear combination of vectors and covectors. The coefficients of this combination is seen to change both contra- and covariantely when the bases (of and ) change. A bilinear form though can be written in terms of a linear combination of covectors. The corresponding coefficients change covariantly with change of basis. These results can be generalized toward an abstract definition of a mathematical object called a tensor. There are two following approaches for algebraically denfining a tensor.

### Tensor as a multilinear form

Motivated by how linear maps, bilinear forms, and multilinear forms and maps can be written by combining basis vectors and covectors, a generalized combination of these vectors can considered. For example,

This object consists of a linear combination of a unified (merged) set of basis vector and covectors (of and ) by scalar coefficients . According to the type of the basis vectors, the indices become sub- or superscript, and hence it determines the type of the transformation regarding that index. By reordering the basis vectors and covectors, we can write,

Recalling that vector components can be written as implines that there is a map for a particular vector such that,

And also the components of a covector determined as,

motivates defining as the array (collection of the coefficients) of a multilinear form,

Therefore, the object or the multi-dimensional array which is dependent on chosen bases of and and its transformation rules are based on the types of the beases (or indices) can be intrinsically related to an underlying multilineat map . By virtue of this observation, an object called a tensor is defined as the following.

Definition: A tensor of type (r,s) on a vector space over a field (or ) is a multilinear map as,

The coordinates or the (scalar) components of a tensor can then be determined once a basis for and a basis for are fixed. Therefore,

Note that r is the number of covariant indices and s is the number of contravariant indices. A tensor of type (r,s) can be imagined as an r+s-dimensional array of data containing elements. Each index corresponds to a dimention of the data array.

By this definition, a vector is a (0,1) tensor as it can be viewed as,

This implies that for each there is a (multilinear with one input) map receiving a basis covector and returning the corresponding scalar component of the vector (). This corresponding map is unique for each vector and it is called a tensor.

A covector (dual vector or a linear form) is a (1,0) tensor because a covector is a linear map

َ A linear map (or ) is a (1,1) tensor because the term in pertains to a covector’s components for a fixed and to a vector’s components for a fixed . Therefore,

in other words, a linear map is a tensor viewed as a multilinear map . Here, the multilibear form can be considered as by which gives . Note that returns a vector and being extract its j-th coordinate, and hencem the array/matrix of the linear map is retrieved.

A bilinear form is then a (2,0) tensor, where

A multilinear form is a (2,1) tensor where where can be considered as .

As an example, the cross product of two vectors defined as is a multilinear map is a (2,1) tensor.

By convension, scalars are (0,0) tensors.

Remark: for a tensor we can write,

Example: Stress tensor. The Cauchy stress tensor in mechanics is a linear map and hence a (1,1) tensor.

#### Rank of a tensor

The rank of a (r,s)-type tensor is defined as r+s. In this regard, tensors of different types can have the same rank. For example tensors of types (1,1), (2,0), (0,2) have the same rank being 2. Here we compare these tensors with eachother.

A (1,1) tensor representing a linear map is where with .

A (2,0) tensor representing a bilinear form is where with .

A (0,2) tensor is where with .

The coeffieints of the above tensor are collected in 2-dimensional array/matrix; however, they follow different transformation rules based on their types.

# PCA: Introduction

Different kind of inter-correlated data can be collected into a random vector. For example we would consider the total length, weight, diameter of nose holes, tail length, number of teeth, colour of eyes, the radius of acetabula, number of vertebrae of dinosaurs as data and collect them into a (random) vector. The principal component analysis (PCA) is a method for reducing the dimension of a random vector (data), and also for discovering outstanding features of data. Using PCA, one can replace a random vector with another vector having a smaller dimension without losing much information. In PCA, the measure of information loss is the total variation of the original variables collected in a random vector.

See this cheat sheet for the mathematical identities used in this post.

## Least-squares optimality of PCA

Assume random variables () collected in a random r-vector , i.e. . It has a mean and a covariance matrix . A linear projection of the random variables means:

\xi=BX\text{ s.t. } \xi \in \mathbb{R}^{t\times 1}\text{ and }B \in \mathbb{R}^{t\times r} \\ t\le r

Each rows of linearly combines the random variables to create a new random variable.

Let’s project the r-vector onto a lower dimensional space and obtain the t-vector with . We want to find a linear map to reconstruct out of with lower dimension. This means finding a vector and a matrix such that . The approximation is in the least-squares sense of minimizing the mean error. We consider the following as the measure of how well the linear map can reconstruct out of its linear projection :

\text E\big [(X-\mu – A\xi)^T(X-\mu – A\xi)\big ]\ \in \mathbb{R}^{+}

Therefore, we should find , , and by minimizing the following mean:

\text E\big [(X-\mu – ABX)^T(X-\mu – ABX)\big ]\ \in \mathbb{R}^+\tag{1}

This expression is the mean of the squared norm/magnitude of the error vector which itself is a random vevtor.

With the assumption that both and are of full rank t, the minimizing parameters will be (see the proof here):

A=\begin{bmatrix}|&|& \dots & |\\v_1 &v_2 &\dots &v_t\\|&|& \dots & | \end{bmatrix}=B^\text T\\ \ \\ \mu=(I_r-AB)\mu_X \tag{2}

where is a normalized eigenvector of the covariance matrix and it is associated with the i-th largest eigenvalue, , of . The eigenvectors are orthonormal to each other, i.e. ( is the Kronecker delta). is the identity matrix.

From the structures of the matrices and , it can be observed that they have rank t. Therefore, the best rank-t approximation of the original random vector is :

X\approx \tilde X= \mu_x+C(X-\mu_x)\qquad \text{s.t }C=AB=AA^\text T \in \mathbb{R}^{r\times r}\tag{3a}

or:

\tilde X= \mu_x+\sum_{i=1}^{t}(v_iv_i^\text T)(X_c)\qquad \text{s.t }X_c=X-\mu_x\tag{3b}

Remark: , and .

The covariance matrix is real and symmetric, therefore it has a spectral decomposition as s.t is an orthogonal matrix (a matrix with ortho-normal columns/rows) with columns as normalized eigenvectors of , and is a diagonal matrix having the eigenvalues, , of . The eigenvectors in are sorted such that the j-th column of is associated with the j-th largest eigenvalue. Having said that, we observe that the best rank-t approximation of the original random vector only uses the first t eigenvectors of random vector’s covariance matrix.

The minimum of Eq. 1 is:

\sum_{i=t+1}^{r} \lambda_i \tag{4}

Principal components (PC’s) of X: the first t principal components (also known as Karhunen-Loeve transformation) of are defined by the following linear projections:

\xi^{(t)}=A^{\text T}X\ \in \mathbb{R}^{t\times 1} \implies\xi_i^{(t)}=v_i^{\text T}X\qquad i=1,\dots ,t\le r\tag{5}

The superscript is used to indicate the dimension of the PC vector, being equal to the number of PC’s.

Note that PC’s are themselves random variables as they are linear combination of the data random variables.

PC’s of X are uncorrelated: the covariance between each pair of the principal components is:

\text {Cov}(\xi_i,\xi_j)=\text {Cov}(v_i^{\text T}X,v_j^{\text T}X)= v_i^{\text T}\Sigma_{XX}v_j=\lambda_j v_i^{\text T}v_j=\delta_{ij}\lambda_j\\ \tag{6}

Interestingly . This means that principal components of are uncorrelated. Moreover, .

A goodness-of-fit measure: equation 4 maps the r-dimensional data () into a lower t-dimensional data . The total variance of the original data, i.e. collected in is . Based on this, a measure on how well the first t principal components (in the lower r-D space) represent the r original data variables is considered as (also note Eq. 4 and 6):

\frac{\sum_{i=t+1}^{r} \lambda_i}{\sum_{i=1}^{r} \lambda_i}=\frac{\text{total variance of }X_i\text{\textquoteright s}-\text{total variance of }\xi_i\text{\textquoteright s}}{\text{total variance of }X_i\text{\textquoteright s}}\tag{7}

This implies that the measure is small if the first t principal components contain a large part of the total variation in ; hence, better approximation.

Remark: if , Eq. 7 becomes zero and it means perfect match, i.e no approximation. In other words, means ( as in the SVD of ), then because is unitary, , hence (Eq. 3), .

## PCA as a variance-maximization method

The main aim of PCA is to reduce the dimension of data. The way a dinosaur would do a sort of reduction on a data vector is keeping one component and eating the rest all. It is a perfectly nonsense reduction, but many pieces of valuable information are lost. Another way, is to do a simple average . This approach is yet undesirable because it treats all components as equal importance (equal weight). Another approach is to do a weighted average with some constraint on the weights, like: such that . In matrix notation and using the Euclidean vector norm:

This constraint weighted average is called a standardised linear combination (SLC). However, the problem is how to uniquely and meaningfully determine the weights. PCA chooses the weights in order to maximize the variances of the SLC’s of the variables ‘s, or to maximize the variances of the linear projections of the random vector, i.e. . SLC, is in fact the projection of the vector onto the direction described by the weight vector. In a mathematical notation, we want to solve the following maximization problem:

\max_{\{w:\|w\|=1\}}\text{Var}(w^\text TX)=\max_{\{w:\|w\|=1\}}w^\text T\text{Var}(X)w=\max_{\{w:\|w\|=1\}}w^\text T\Sigma_{XX}(X)w\\ \ \\ w\in \mathbb{R}^{r\times 1},\quad X\in \mathbb{R}^{r\times 1} \quad,\Sigma_{XX}\in \mathbb{R}^{r\times r} \ \tag{8a}

In other words,

v=\argmax_{\{w:\|w\|=1\}}\text{Var}(w^\text TX)\tag{8b}

Principal components of X are defined based on this maximization. The 1st PC of X, denoted by is the SLC (of the variables) with the highest variance. This is obtained by the maximizing problem expressed by Eq. 8. then has the highest variance among all other SLC of the variables/projections of :

v_1=\argmax_{\{w:\|w\|=1\}}\text{Var}(w^\text TX)\qquad v_1\in \mathbb{R}^{r\times 1} \\ \ \\ \text {PC}_1:=\xi_1:=v_1^\text TX\qquad \xi_1\in\mathbb{R}\ \tag{9}

is called the 1st principal direction (PD) or principal axis.

The 2nd PC of X, denoted by is the SLC (of the variables) with the highest variance such that the maximizing weight vector, is also subjected to the orthogonality constraint . In a mathematical notation, the 2nd PC is obtained by the following maximization:

v_2=\argmax_{\{w:\|w\|=1 \text{ and }w^\text Tv_1=0\}}\text{Var}(w^\text TX) \\ \ \\ \text {PC}_2:=\xi_2:=v_2^\text TX\qquad \ \tag{10}

by the same token, the i-th PC and i-th PD or principal axis are defined by the following:

v_i=\argmax_{\{w:\|w\|=1\text{ and }w^\text T v_j=0 \text{ for all } j=1,\dots,i-1\}}\text{Var}(w^\text TX) \\ \ \\ \text {PC}_i:=\xi_i:=v_i^\text TX\qquad \ \tag{11}

It can be proved (see the references) that the maximizing weight vectors are the normalized eigenvectors of associated with its eigenvalues sorted in the decreasing order, i.e.:

\Sigma_{XX}=V\varLambda V^\text T\ \text{ s.t } \lambda_{i}\ge \lambda_{j} \implies i\lt j\\ \ \\ \text{i.e. }\lambda_1\ge \lambda_2\ge \dots \ge\lambda_r \ge 0

Therefore:

\text{cov}(\xi_i,\xi_j)=\text {Cov}(v_i^\text TX,v_j^\text TX)=v_i^\text T\Sigma_{XX} v_j=\lambda_j v_i^{\text T}v_j=\delta_{ij}\lambda_j\\ \ \\ \implies \text{Var}(\xi_1)\ge \text{Var}(\xi_2)\ge\dots\ge\text{Var}(\xi_r)\ge0

which means PC’s not only have the maximum variances among all the possible SLC of the variables, or all the possible projections of the random vector, but also they are mutually uncorrelated.

In a matrix form, we can write:

\xi=V^\text TX\qquad \xi,X\in\mathbb{R}^{r\times 1},\ V^\text T\in\mathbb{R}^{r\times r}\tag{12}

where and are the random vector and its linear projection vector containing the PC’s of respectively.

\text{Var}(\xi)=\text{Var}(V^\text TX)=V^\text T\Sigma_{XX}V=\delta_{ij}\lambda_j

## PCA in practice: Sample PCA

In practice, the PC’s are to be estimated out of n independent observations of the random r-vector of data. This requires estimating of the mean and covariance matrix of the random vector of data; using sample mean and covariance.

Here, we use the notation for the i-th independent trial to observe the random vector .

Note: we denote an observation vector with a capital letter to indicate that it is still a random vector (out of the same population and through independent trials). To indicate a particular realization/value of we shall use small letter (by convention).

We collect the sequence of n independent observations of the random vector in a set as:

\{\bold X_i, i=1,2,\dots,n\}\ \text{ s.t }\bold X_i\in\mathbb{R}^{r\times 1}\tag{13}

The estimate for is given by:

\hat \mu_X=\bold{\bar X}:=\frac{1}{n}\sum_{i=1}^{n}\bold X_i\tag{14}

let for n observations of the random vector; and let . Then, the estimate for the covariance matrix is:

\hat\Sigma_{XX}=\frac{1}{n}\chi_c\chi_c^\text T\tag{15}

Ordered eigenvalues and their associated eigenvectors of is then estimate the population parameters. This leads to writing:

\hat\lambda_1\ge\hat\lambda_2\ge\dots\ge\hat\lambda_r\ge 0\\ \ \\ \hat A=(\hat v_1,\dots,\hat v_t)=\hat B^\text T\tag{16}

In practice, the vectors of independent observations are collected as columns (or rows) of a matrix (just re-writing Eq. 13). Let’s call this matrix the matrix of observations. Here, each column represents an observation of :

\bold X=\begin{bmatrix}|&|&\dots &|\\ \bold X_1 & \bold X_2 &\dots &\bold X_n \\|&|&\dots &| \end{bmatrix}\in \mathbb{R}^{r\times n}\tag{17}

Recalling Eq. 5 or Eq. 12, PC’s can be calculated for each columns of the matrix of observations. In this regards, we obtain estimates of the PC’s of :

\hat\xi_{ij}^{(t)}=\hat v_j^\text T\bold X_i\ \in\mathbb{R}\\ j=1,\dots,t\ \text{ and }\ i=1,\dots,n

This is the estimate of the j-th PC of calculated based on the i-th observation . The term is called the score for the i-th observation on the j-th PC.

For each observation, we can write the above in a more compact from:

\hat\xi_{i}^{(t)}=\hat A^\text T\bold X_i\ \in\mathbb{R}^{t\times 1}\\ i=1,\dots,n\tag{18}

### References

– Modern Multivariate Statistical Techniques, A. J. Izenman.
– Applied Multivariate Statistical Analysis, W. Hardle, L. Simar.

Although mostly paraphrased, this post might contain copyrighted sentences of the references.

# Procrustes Analyses

## Some definitions*

1) The Shape of an object is all the geometrical information that remains when, location, scale, and rotational effects are removed from and object. In other words, shape of an object is invariant under Euclidean similarity transformations, which are translation, isotropic scaling, and rotation.

2) The Form of an object is all the geometrical information that remains when location and rotational effect of an objects are removed. Form is also called size-and-shape. Form is invariant under rigid-body transformations. Here, we can say shape is the form with size removed (the size/scale information is removed).

3) Same form and same shape: two objects have the same form, size-and-shape, if they can be translated and rotated relative to each other so that they exactly match. Two objects have the same shape, if they can be translated, rotated and scaled relative to each other so that they exactly match.

The words matching, registration, superimposing and fitting are equivalent.

4) A landmark is a point of correspondence on each object. e.g tip of the nose of all the dinosaurs in a population/sample.

5) Geometric shape analysis: An object can be (discretely) represented by a collection of landmark (point cloud), therefore, the landmarks coordinates retain (continue to have) the geometry of the point-cloud configuration. This approach to shape analysis is called the geometric shape analysis.

6) The Configuration is a collection (set) of landmarks on/of a particular object. The configuration of a m-D object with k landmarks can be represented by a matrix called the configuration matrix. For a 3D object we have:

\begin{bmatrix} x_1&y_1&z_1\\ x_2&y_2&z_2\\ x_3&y_3&z_3\\ \vdots& \vdots&\vdots\\ x_k&y_k&z_k\\ \end{bmatrix}

7) A size measure of a configuration is any positive real-valued function of the configuration matrix, X, such that .

8) The centroid size is defined as the square root of the sum of squared Euclidean distances from each landmark to the centroid of a configuration:

S(X):=\sqrt{\sum_{i=1}^{k}{\|X_{i,}-\overline{X}\|^2}}\tag{1}

where is the i-th row of X, and . is the Euclidean vector norm.

The centroid size can be re-written using matrix algebra:

S(X)=\|CX\|\tag{2}

where with being the identity matrix and is called the centering matrix, and is the Frobenius norm, sometimes also called the Euclidean norm, of a matrix, i.e.
. If the centroid of is already located at the origin, then .

9) The Euclidean similarity transformations of a configuration matrix X are the set of translated, rotated, and isotropically scaled X. In a mathematical notation:

\{\beta X \Gamma + 1_k\gamma^{\text T}:\beta\in \mathbb{R}^+, \Gamma\in SO(m),\gamma\in \mathbb{R}^m\}\tag{3}

where, is the scale (factor), is the space of rotational matrices, is the translation vector. Vectors are considered as column vectors. is a column vector of ones. for 3D.

10) Rigid-body transformations of a configuration matrix X are the set of translated and rotated X. In a mathematical notation:

\{X \Gamma + 1_k\gamma^{\text T}:\beta\in \mathbb{R}^+, \Gamma\in SO(m),\gamma\in \mathbb{R}^m\}\tag{4}

11) The centered pre-shape of a configuration matrix X is obtained (defined) by removing the location and scale/size information of the configuration. Location can be removed by centering the configuration, i.e. translating the configuration by moving its centroid to the origin. Therefore, the translated configuration is . The size is removed by scaling the configuration by its size defined as Eq. 2. Therefore, the pre-shape of X is obtained as:

Z_C:=\frac{CX}{\|CX\|}=\frac{CX}{\|X_C\|}\tag{5}

Removing size or scale means scaling to the unit size. Thereby, two configurations with different sizes lose their size information and also their scale. The scale is a quantity that expresses the ratio between sizes of two configurations, either two with different shapes or two with the same shape (one is the scaled version of the other one). In another scenario (like the full Procrustes analyses), if two or more configuration are scaled with different scale (factors), they also lose their scale/size information (relative to each other) although their sizes may not be unit.

12) Reworded definition of a shape: A shape is an equivalence class of geometrical figures/objects/configurations modulo (what remains after) the similarity transformation. The shape of X is denoted by [X]. In order to visualise the shape of an configuration, a representative of the equivalence class is considered and called icon, denoted by [X].

Two configurations and have the same shape, i.e. iff there exist such that:

X’=\beta X \Gamma + 1_k\gamma^{\text T}

In other words, two configurations have the same shape if they perfectly match after the removal of their location, rotation, and size/scale.

13) Shape space is the space/set of all shapes, i.e. equivalence classes. For example . All have their locations, rotations, and size removed.

14) Reworded definition of form (size-and-shape): form is an equivalence class of geometrical configurations modulo translation and rotation. The form/size-and-shape of a configuration/object X is denoted by , i.e an icon for the class. Two configuration and have the same form, i.e. iff there exist such that:

X’=\Gamma X + 1_k\gamma^{\text T}

15) Size-and-Shape/Form space is the space/set of all forms, i.e. equivalence classes. For example .

16) By removing the size of a form (scaling to unit centroid size) of a configuration, the shape of the configuration is obtained.

17) Clarification on the term “removing”: let’s imagine a set of n configurations (). We want to remove the scale, location, and rotation(al) information of the configurations. Removal of scale/size is to scale all the configurations to unit size (each divided by for example its centroid size).

Removal of location is to translate all the configurations to a particular target point/location in the space by selecting the same single landmark or point of reference (like the centroid) of each configuration and (rigidly) translating the configuration along the vector from the point of reference to the target point. The target point can be the origin of the coordinate system/frame of reference.

Removal of rotation of the configurations is through optimising over rotations by minimising some criterion. For instance, to remove the rotations (rotational information) of two configurations with their locations (and maybe scales/sizes) already removed, it is needed to apply a particular relative rotation on them (e.g. keep one stationary and rotate the other one) so that they get closest, with respect to some definition of distance, as possible. Once the rotation is removed, then any deviation of one configuration from the other one originates from their locations, scales/sizes, and their shapes. To compare shapes, we also have to remove the locations and scales of the configurations. Then any deviation is due to the shape of the configurations. If only location and rotational information are removed, then any relative deviation originates from the sizes and shapes (form) of the configurations.

## Matching two configurations and measure of distance between shapes

Consider a collection/set of shapes, as defined above, all having the same number of landmarks and the landmarks are in one-to-one correspondence. An important question is how to define a distance between two shapes in the set. This is natural to define a measure between the members of a set. The notion of shape distance indicates the difference between two shapes, i.e. how far two shapes are far from each other in the shape space.

A natural approach is to match/fit/superimpose/register a configuration X to a configuration Y (same number of landmarks and in correspondence) using the similarity transformations applied on X, and then the magnitude of difference between the fitted configuration of X, i.e. and indicates the magnitude of difference in shape between the two configurations X and Y.

So how to match two configuration? The full ordinary Procrustes analysis (FOPA) is the method to match/register/superimpose/fit two (m-D) configurations with the same number of landmarks as closely as possible onto each other. This method applies the similarity transformations to one of the configurations to closely match the other one. But what is the measure of closeness? The definition of the method makes it perspicuous:

The method of FOPA matches two configurations through transforming one onto the other one (target configuration) up to the similarity transformations and minimizing the squared Frobenius/Euclidean norm of the difference between the transforming and the target configurations. This means to minimize the following expression:

D_{FOPA}^2(X_1,X_2):=\|X_2-(\beta X_1\Gamma+1_k\gamma^\text{T})\|^2\tag{6}

The minimum value of is a measure of similarity/dissimilarity. Observe that is the transforming configuration (input) and is the target configuration. Also, all the similarity transformations are involved in the registration; this is why the term full is used. In order to find the matching configuration, the similarity parameters should be determined, i.e.:

(\hat\beta, \hat\Gamma, \hat\gamma)=\argmin_{\beta,\Gamma,\gamma}\|X_2-\beta X_1\Gamma-1_k\gamma^\text{T}\|^2\tag{7}

It is good to know that if the configurations are already centered.

The full Procrustes fit of onto is then:

X_1^{FP}=\hat\beta X_1\hat\Gamma+1_k\hat\gamma^\text{T}\tag{8}

It is tempting to chose the minimum of Eq. 6 as a measure of distance between the shapes of the two configurations. However, , i.e. not commutative, unless the configurations both have the same size; for example having the unit size by getting pre-scaled. Therefore, the full Procrustes distance, between two shapes (of two configurations) is defined as:

d_F^2(X_1,X_2):=\min\ D_{FOPA}^2(\frac{X_1}{\|X_1\|},\frac{X_2}{\|X_2\|})\tag{9}

FOPA involves isotropic scaling of the input configuration, in addition to the other two similarity transformations, to register it onto the target configuration. This means that the fitted configuration does not have the same size/scale as its original configuration. Therefore, FOPA removes the size of the input configuration.

Another scenario of fitting the input configuration to the target one is to register while preserving the size/scale of the input, i.e. not to scale the input (the scale of the target is also preserved). In this regard, partial ordinary Procrustes analysis (POPA) is a method of registration only over translation and rotation to match two configurations. This involves the minimization of:

D_{POPA}^2(X_1,X_2):=\|X_2-(X_1\Gamma+1_k\gamma^\text{T})\|^2\tag{10}

The partial Procrustes fit of onto is then: (see Eq. 7 with ):

X_1^{PP}=X_1\hat\Gamma+1_k\hat\gamma^\text{T}\tag{11}

Based on POPA, the partial Procrustes distance can be defined. If the two configurations have the same size, like by getting pre-scaled to the unit size, i.e. the size is removed, then the following is defined as the the partial Procrustes distance between two shapes (of two configurations) :

d_P^2(X_1,X_2):=\min\ D_{POPA}^2(\frac{X_1}{\|X_1\|},\frac{X_2}{\|X_2\|})\tag{12}

Calculating POPA distance, unlike the FOPA distance, does not involve additional scaling of the the (pre-scaled) unit sized configurations, i.e. in the minimization of . Nevertheless, obtaining POPA distance still entails removing the scale of the configurations.

Remark: the two defined distances have different values.

## Mean shape

First some motivation from the arithmetic mean of numbers. Every dinosaur knows that the arithmetic mean/average of n real numbers is defined as: . Let’s define a function as . In other words:

g(\mu):=\frac{1}{n} \sum_{i=1}^{n} (x_i-\mu)

It is now obvious that if we let then . This is what we want out of defining the function. However, zero is not the minimum of as this function is an absolutely decreasing function, hence no local or global minimum (just take the derivative w.r.t ). Let’s define a function as:

f(\mu):=\frac{1}{n} \sum_{i=1}^{n} (x_i-\mu)^2

What minimizes ? Solving , we get:

\frac{\mathrm d}{\mathrm d \mu}f(\mu)=\frac{1}{n} \sum_{i=1}^{n} -2(x_i-\mu)=0 \newline \implies \frac{1}{n} \sum_{i=1}^{n} (x_i-\mu)=0

This is just which has the answer . The second derivative of is which indicates that minimizes .

Using the Euclidean distance/metric/norm of real numbers, i.e. with being the absolute value, we can re-define as:

f(\mu):=\frac{1}{n} \sum_{i=1}^{n} |x_i-\mu|^2=\frac{1}{n} \sum_{1}^{n} d_{E}^2(x_i,\mu)

This results in (it can be easily proved by equating the derivative with zero):

\hat\mu=\argmin_{\mu}\frac{1}{n} \sum_{i=1}^{n} d_{E}^2(x_i,\mu)

This motivation suggests that for a set of objects with a defined distance function/metric, a mean/average object can be imagined. This average object is such an object that minimizes the average sum of squared distances from all objects to the average object. The Mean shape of a set of shapes sampled from a population of shapes can be defined using the distance functions as defined by Eq. 9 and Eq. 12:

The sample full Procrustes mean shape is defined as:

\hat\mu_F:=\arg\inf_{\mu}\frac{1}{n} \sum_{i=1}^{n} d_{F}^2(X_i,\mu)\tag{13}

The sample partial Procrustes mean shape is defined as:

\hat\mu_P:=\arg\inf_{\mu}\frac{1}{n} \sum_{i=1}^{n} d_{P}^2(X_i,\mu)\tag{14}

In words, the mean shape is a shape with minimum distance from each configuration’s shape.

Note that in both definitions, the configurations will be pre-scaled to the unit size (due to the definition of the metrics).

Remark: the mean shape refers to the mean of the shapes not the mean of configurations.

## Matching more than two configurations: generalized Procrustes analyses

The term FOPA/POPA refers to Procrustes matching of one configuration
onto another. If there are more than two configurations, it is natural to register/match them all to a mean shape or mean form. To do so, generalized Procrustes analysis (GPA) is utilized to superimpose two or more configurations onto a common unknown mean shape/form. GPA has two methods: full and partial GPA’s. The GPA provides a practical method of computing the sample mean shape defined by Eq. 13 or 14. Unlike the full/partial OPA, the output of GPA (the Procrustes fits) does not depend on the order of superimposition.

### Full GPA

The method of full GPA transforms configurations up to the similarity transformations (scaling + rotation + translation) with respect to a common unknown mean shape in order to minimise the following total sum of squares with respect to the transformation parameters and , while imposing a size constraint to avoid a trivial solution. A trivial solution can be all close to zero (in order not to make a configuration vanished, scale is not allowed to be zero) and no translation; hence the mean shape will be something with size close to zero, i.e almost vanished.

G(X_1,…X_n):=\sum_{i=1}^{n} \|\beta_iX_i\Gamma_i+1_k\gamma_i^\text{T}-\mu\|^2\tag{15} \newline \sum_{i=1}^{n}S^2(\beta_iX_i\Gamma_i+1_k\gamma_i^\text{T})=\sum_{i=1}^{n}S^2(X_i)

In the above expression, are unknown. Once the expression is solved, the estimate of the optimum/minimizing parameters, , and estimate of the mean shape, are found. Note that if we already knew the population mean shape, we would just align each configuration separately to that mean shape.

Once the above minimization is solved, the full (generalized) Procrustes fit of each configuration (onto the estimated mean) is:

X_i^{FP}=\hat\beta_iX_i\hat\Gamma_i+1_k\hat\gamma_i^\text{T}\tag{16}

Note that the scale of each transformed configuration (full Procrustes fit) is not necessarily the same. Therefore, full GPA does not preserve the scale of the configurations through their optimal transformations. This means that if one configuration is times larger than the other one, they will be times of each other after the optimal transformation and getting fitted to the mean shape.

Assuming that minimize expression 15, we can write:

\hat\mu=\argmin_{\mu}\sum_{i=1}^{n} \|\hat\beta_iX_i\hat\Gamma_i+1_k\hat\gamma_i^\text{T}-\mu\|^2\text{ or}\tag{17} \newline \hat\mu=\argmin_{\mu}\sum_{i=1}^{n} \|X_i^{FP}-\mu\|^2

The solution is then (see the proof here):

\hat\mu=\frac{1}{n}\sum_{i=1}^{n} X_i^{FP}\tag{18}

which is the arithmetic mean of the Procrustes fits. This means that once the configurations are registered to the estimate of the unknown mean, their arithmetic mean/average equals the estimated mean.

Now we want to show that the estimated mean shape has the same shape as the sample full Procrustes mean shape defined by Eq. 13. Without loss of generality, we pre-scale the configurations to the unit size, and replace the size constraint with .

Since Eq. 15 is some of positive reals, the optimum transformation parameters should also minimize each term in the sum. This means that minimizes each . Therefore, following Eq. 13, the sample full Procrustes mean shape of the full Procrustes fit is:

\hat\mu_F:=\arg\inf_{\mu}\frac{1}{n} \sum_{i=1}^{n}\min_{\beta_i,\Gamma_i,\gamma_i}\|(\beta_i X_i\Gamma_i+1_k\gamma_i^\text{T})-\mu\|^2 \newline =\arg\inf_{\mu}\frac{1}{n} \sum_{i=1}^{n}\|(\hat\beta_i X_i\hat\Gamma_i+1_k\hat\gamma_i^\text{T})-\mu\|^2 \newline=(n^{-1})\arg\inf_{\mu}\sum_{i=1}^{n}\|X_i^{FP}-\mu\|^2

Therefore, (by Eq. 16 and 17), . These two have the same shape, which means there is a similarity transformation to perfectly superimpose them on each other. Here, it is obvious that only scaling by the factor of is needed.

Estimating the transformation parameters of full GPA method in order to estimate the sample full Procrustes mean shape and registering all configurations to the estimated mean shape, is then equivalent to the following minimization problem:

\inf_{\beta_i,\Gamma_i,\gamma_i}\sum_{i=1}^{n} \|(\beta_i X_i\Gamma_i+1_k\gamma_i^\text{T}) – \frac{1}{n} \sum_{j=1}^{n} (\beta_j X_j\Gamma_j+1_k\gamma_j^\text{T})\|^2\\ \sum_{i=1}^{n}S^2(\beta_iX_i\Gamma_i+1_k\gamma_i^\text{T})=\sum_{i=1}^{n}S^2(X_i)

It can be proved that above is equal to (See the proof here):

\inf_{\beta_i,\Gamma_i,\gamma_i}\frac{1}{n}\sum_{i=1}^{n} \sum_{\underset{{j\lt n}}{j=i+1}}^{n}\|(\beta_i X_i\Gamma_i+1_k\gamma_i^\text{T}) – (\beta_j X_j\Gamma_j+1_k\gamma_j^\text{T})\|^2\\ \newline \sum_{i=1}^{n}S^2(\beta_iX_i\Gamma_i+1_k\gamma_i^\text{T})=\sum_{i=1}^{n}S^2(X_i)

Above means that the full GPA minimizes the sum of pairwise squared Frobenius distances between all the transformed versions of the configurations ( (full Procrustes fits) in the sample. It should be noted that, the output of the full GPA is configurations that are translated, rotated, and scaled versions of the original configurations. Hence, the scale, location, and rotational information of each configuration is removed. The output configurations are registered onto each other such that the sum of pairwise squared Frobenius distances between all of them is minimized. If the scale of each output configuration is removed by re-scaling it to unit size, then full Procrustes distance between each two shapes (of the output configurations), and also the estimated mean shape (once re-scaled to unit size) can be calculated.

Remark: the full GPA does not lead to the minimized (squared) Frobenius distance between two full Procrustes fits , but it leads to the minimized sum of the Frobenius distances of each full Procrustes fit and the rest of them in the set. It does minimize the (squared) Frobenius distance between each full Procrustes fit and the (full Procrustes) mean shape though.

Remark: The full GPA removes the scale of the configurations by scaling them in order to make them closer to the mean shape. The scales are not the same for all the configurations and hence the full GPA does not preserve the (original) scale (relative size) of the configurations. The full Procrustes fits and the estimated mean shape may not have the unit size. However, they are relatively scaled in a way to minimize the sum of squared distances. This is the notion of removing the scale information (of the original configurations). Therefore, the Procrustes fits represent the shape (information) of the configurations and the differences between each of them (between corresponding landmarks) and the mean shape is purely due to their shape variations.

In summary, the full GPA of configurations leads to a mean shape and full Procrustes fits such that each full Procrustes fit is superimposed onto the mean shape through the similarity transformations; hence, whatever geometrical variations/discrepancies remains between each full Procrustes fit and the mean shape is purely due to their shapes.

An algorithm for full GPA is as follows:

1- Translations/Removing locations: remove the location of each configuration by centring each configuration (each configuration is translated to the origin by its centroid) and initially let each Procrustes fits be:

X_i^{FP}= CX_i\text{ for } i=1,…,n

2- estimate as .

3- Rotations and scales: perform a FOPA on each centred configuration to rotate and scale it to . Now each recently transformed configuration, , is fitted to the estimated mean.

{X}_i^{*FP}=\hat\beta_i X_i^{FP} \hat\Gamma_i

and are transformation parameters out of the FOPA. Note that the rotation is about the origin, and no translation is needed as the locations are already removed.

4- Update the mean shape: .

5- let . Re-centring is not necessary because the centroid of each configuration at this stage is already located at the origin and the rotations are also about the origin. Re-scaling won’t change the location of the centroid because it is isotropic.

6- Repeat 3 to 5 until everyone’s happy, i.e. the change in the Procrustes sum of squares () from one iteration to the next one becomes less than a particular tolerance.

Remark: The configurations can be pre-scaled to have the unit size but it is not necessary because their size is removed once they get scaled to better fit the estimated mean at each iteration.

### Partial GPA

Partial GPA involves merely registration of a set of configurations by translation and rotation (not scaling). Partial GPA preserves the scale/size of the configurations as opposed to full GPA. This method is appropriate for analysis of forms, i.e. size-and-shapes. Partial GPA is through minimization of the following sum with respect to the transformation parameters () and and unknown form . Size constraint is not needed as there is no scaling involved.

G_P(X_1,…X_n):=\sum_{i=1}^{n} \|X_i\Gamma_i+1_k\gamma_i^\text{T}-\mu\|^2\tag{19}

Once the above minimization is solved, the partial Procrustes fit of each configuration onto the common form is:

X_i^{PP}=X_i\hat\Gamma_i+1_k\hat\gamma_i^\text{T}\tag{20}

Similar to Eq. 17 and 18, we can write:

\hat\mu=\frac{1}{n}\sum_{i=1}^{n} X_i^{PP}\tag{21}

However, is not going to have the same shape as the mean shapes previously defined (sample full/partial Procrustes mean shape). is then a form that its Frobenius distance to each partial Procrustes fit is minimum. I and some dinosaur think that this partial GPA is appropriate to group-register a set of configurations onto each other, however, we don not know its application at this moment.

The algorithm for partial GPA is the same as that of the full GPA except that the scale is removed i.e. set as 1.

* The definitions follow the book “statistical shape analysis, Ian L. Dryden, Kanti V. Mardia”, and the article “Procrustes Methods in the Statistical Analysis of Shape, Colin Goodall”. I may not have paraphrased them all, hence, some sentences may be subjected to copy right and they are not intended to be reused in official publications.