Tensors 1


Einstein summation convention is used here. A matrix M is denoted as [M] and its ij-th element is referred to by [M]_{ij}. Quantities or coefficients are indexed as for example u^i, A_{ij} or A_i^j. These indices do not automatically pertain to row and column indices of a matrix, but the quantities can be presented by matrices through isomorphisms once their indices are freely interpreted as rows and columns of matrices.

Coordinates of a vector

Let V be a n-dimensional vector space and \mathcal B=\{e_1,\cdots, e_n\} with e_i\in V be a basis for V. Then, we define the coordinate function as,

    \[[\cdot]_{\mathcal B}:V\to \mathbb M^{n\times 1}\]

such that for a vector v\in V written by its components (with respect to \mathcal B) as v=v_ie_i the function acts as,

    \[[v]_{\mathcal B}=\begin{bmatrix}v_1 \\ \vdots \\ v_n \end{bmatrix}\]

The coordinate function is a linear map.

Change of basis for vectors

Let \mathcal B and \tilde {\mathcal B} be two basis for V, then,

\tilde e_i=F_{ji}e_j and e_i=B_{ji}\tilde e_j

where the indices of the scalar terms F_{ji} and B_{ji} are intentionally set this way. So, if all F_{mn} are collected into a matrix [F], then the sum F_{ji}e_j is over the rows of the matrix for a particular column. In other words, we can utilize the rule of matrix multiplication and write,

    \[\tilde e_i=F_{ji}e_j=[F]^{\rm T}\begin{bmatrix} e_1\\ \vdots \\ e_n \end{bmatrix}\]

The same is true for [B]:=B_{ji}. In above formulations, note that j is a dummy index (i.e. we can equivalently write \tilde e_j=F_{ij}e_i=F_{ki}e_k)

Setting \mathcal B as the initial (old) basis and writing the current (new) basis \tilde {\mathcal B} in terms of \mathcal B is referred to as forward transform denoted by F_{ij}. Relatively, B_{ij} is called backward transform.

The relation between that forward and backward transforms is obtained as follows,

    \[\begin{split}e_i &= B_{ji} \tilde e_j=B_{ji}F_{kj}e_k\\&\implies B_{ji}F_{kj}=\delta_{ik}\\&\therefore [F]=[B]^{-1} \ , [B] = [F]^{-1}\end{split}\]

We now find how vector coordinates are transformed relative to different bases. A particular v\in V can be expressed by its components according to any of \mathcal B or \tilde{\mathcal B} basis, therefore,

    \[v=v_ie_i=\tilde v_i \tilde e_i\]

To find the relation between [v]_{\tilde{\mathcal B}} and [v]_{\mathcal B} we write,

    \[\begin{split}v&=v_ie_i=\tilde v_i\tilde e_i \implies v_ie_i = v_i B_{ji} \tilde e_j\equiv C_j\tilde e_j\implies C_j=\tilde v_j\\&\therefore \tilde v_i = B_{ij}v_j\equiv [B][v]_{\mathcal B}\\&\implies v_i = F_{ij}\tilde v_j\equiv [F][v]_{\tilde {\mathcal B}}\end{split}\]

As it can be observed, the old basis to new basis is transformed by the forward transform F_{ij} while the old coordinates v_i are transformed to the new ones, \tilde v_i, by the backward transform B_{ij}. Because the coordinates of v behave contrary to the basis vectors in transformation, the coordinates or the scalar components are said to be contravariant. A vector can be called a contravariant object because its scalar components (coordinates) transforms differently from the basis vectors whose their linearly combination equals to the vector. Briefly,

Proposition: Let v=v_ie_i. Then, the scalar components/coordinates v_i are transformed by B_{ij} if and only if the basis vectors e_i are transformed by F_{ij}, such that B_{ji}F_{kj}=\delta_{ik}.

Later, a vector is called a contravariant tensor. For the sake of notation and to distinguish between the transformations of the basis and the coordinates of a vector, in index of a coordinate is written as superscript to show it is contravariant. Therefore,

    \[v = v^ie_i=\tilde v^i\tilde e_i\]

Linear maps and linear functionals

Definition: \mathcal L(V, W) is defined as the space of all linear maps V\to W where the domain and codomain are vectors spaces.

It can be proved that \mathcal L is a vector space (\mathcal L, +, \cdot), hence, for T_1, T_2\in \mathcal L(V,W) and \alpha\in \mathbb R

    \[\alpha\cdot T_1=\alpha T_1\quad , (T_1+T_2)()=T_1()+T_2()\]

Note that the addition on the LHS is an operator in \mathcal L and the addition on the RHS is an operator in W.

Proposition 1: Let T\in \mathcal L(V, W), i.e a linear map from a vector space V to another one W. If \mathcal B=\{e_1, \cdots, e_n\} is a basis for V, and T(e_i)=w_i for w_i\in W and i=1,\cdots , n, then T is uniquely defined over V.

This proposition says a linear map over a space is uniquely determined by its action on the basis vectors of that space. In other words, if T(e_i)=w_i and T^*(e_i)=w_i then \forall v\in V, \ T(v)=T^*(v). proof: let T(e_i)=w_i (given by the nature of T), then for v\in V such that v=v^ie_i, we can write v^iT(e_i)=v^iw_i, therefore, T(v^ie_i)=T(v)=v^iw_i. Because, v^i‘s are unique for (a particular) v then v^iw_i is unique for v and hence T(v) must be unique for any v\in V. In other word, there is only one unique T over V such that T(e_i)=w_i.

As a side remark, if \mathcal B=\{e_1, \cdots, e_n\} is a basis for V, hence spanning V, then \{T(e_i)| i=1,\cdots n\} spans the range of T; The range of T is a subset of W.

By this proposition, a matrix completely determining a linear can be obtained for the linear map. let V be n-dimensional with a basis \mathcal B=\{e_i\}_1^n, and W be m-dimensional with a basis \mathcal B'=\{e_i'\}_1^m. Then there are coefficients T_i^j such that,

    \[T(e_i)= T_i^j e_j'\]

In the notation T_i^j, the index j is superscript because for a fixed e_i and hence a fixed i, the term T_i^j is the coordinate of T(e_i)\in W and it is a contravariant (e.g T(e_3)=T_3^j e_j'\equiv v^je_j').

For v\in V, and w=T(v), with the coordinates [v]_{\mathcal B} and [w]_{\mathcal B'}, we write,

    \[\begin{split}w=T(v^ie_i)&=v^iT(e_i)=v^iT_i^je_j' \\&\implies w^j = v^iT_i^j \equiv  T_i^j v^i \end{split}\]

This expression can be written as a matrix multiplication of [w]_{\mathcal B'}=[M][v]_{\mathcal B}, where [T]:=\mathcal M(T)\in \mathbb M^{m\times n} presented by its elements as,

    \[\begin{bmatrix} T_1^1 && T_2^1 && \cdots && T_n^1 \\T_1^2 && T_2^2 && \cdots && T_n^2 \\\vdots && \vdots && \cdots && \vdots\\T_1^m && T_2^m && \cdots && T_n^m \end{bmatrix}\]

As a remark, above can be viewed as columns of the matrix and written as,

    \[[T]=\begin{bmatrix} [T(e_1)]_{\mathcal B'} && [T(e_2)]_{\mathcal B'} && \cdots && [T(e_n)]_{\mathcal B'} \end{bmatrix}\]

Linear functional (linear form or covector)

Definition: a linear functional on V is a linear map f\in V^* :=\mathcal L(V,\mathbb F). The space V^* is called the dual space of V.

Proposition: Let \mathcal B=\{e_1, \cdots, e_n\} and \varepsilon_i \in V^* be defined as \varepsilon_i(e_j):=\delta_{ij}. Then, \{\varepsilon_i\}_1^n called dual basis of \mathcal B, is a basis of V^*, and hence \dim V = \dim V^*.

Proof: first we show that \varepsilon_i‘s are linearly independent, i.e. c_i\varepsilon_i=0 \implies c_i=0 \forall i=1, \cdots, n. Note that on the RHS, 0\in V^*. For a v\in V we can write c_i\varepsilon_i(v) and assume c_i\varepsilon_i(v)=0. Then,

    \[c_i\varepsilon_i(v)=c_i\varepsilon_i(v^je_j)=0\implies c_iv^j\varepsilon_i(e_j)=0\implies c_iv^j\delta_{ij}=0\implies c_iv_i=0\]

Since v is arbitrary, c_i=0 ■ .

Now we prove that \{\varepsilon_i\}_1^n spans V^*. I.e \forall f \in V^* \exists \{c_1, \cdots, c_n\} such that f=c_i\varepsilon_i. To this end, we apply both sides to a basis vector of V and write f(e_j)=c_i\varepsilon_i(e_j) which implies f(e_j)=c_j or explicitly c_j is found as c_j=f(e_j). Consequently, f=f(e_i)\varepsilon_i ■.

Consider V and \mathcal B. If f\in V^*, then the matrix of the linear functional/map f is

    \[[M]=\mathcal M(f)=\begin{bmatrix} f(e_1) && \cdots && f(e_n)\end{bmatrix}\in \mathbb M^{1\times n}\]

So, for v\in V as v=v^ie_i we can write,

    \[f(v)=[M][v]_\mathcal B\quad \in \mathbb R\]

Result: if the coordinates of a vector is shown by a column vector or single-column matrix (which is a vector in the space of \mathbb M^{n\times 1}), then a row vector or a single-row matrix represents the matrix of a linear functional.

Definition: a linear functional f\in V^*, which can be identified with a row vector as its matrix, is also called a covector.

Like vectors, a covector (and any linear map) is a mathematical object that is independent of a basis (i.e. invariant). The geometric representation of a vector in (or by an isomorphism in) \mathbb R^3 is an arrow in \mathbb E^3. For a covector isomorphic to \mathbb R^2, the representation is a set (stack) of planes in \mathbb E^3 that can be represented by iso lines in \mathbb E^2. A covector that is isomorphic to \mathbb R^3 can be represented by iso surfaces in \mathbb E^3.

Example: Let \mathcal B = \{e_1, e_2\} be a basis of V and [2,1] be the matrix of a covector f in some V^*. Then, if [x]_{\mathcal B} = [x_1,x_2]^{\rm T}, we can write,

    \[y=[2,1]\begin{bmatrix} x_1 \\ x_2 \end{bmatrix} \implies y = 2x_1 + x_2\]

which, for different values of y, is a set of (iso) lines in a Cartesian CS defined by two axes x_1 and x_2 along e_{1g} and e_{2g} that are the geometric representations of e_1 and e_2. The Cartesian axes are not necessarily orthogonal.

If we chose any other basis \tilde {\mathcal B} = \{\tilde e_1, \tilde e_2\} for V, then the matrix of the covector f changes. Also, the geometric representations of \{\tilde e_1, \tilde e_2\} are different from e_{1g} and e_{2g} and hence the geometric representation of the covector stays the same shape.

Example: Let \mathcal B = \{e_1, e_2\} be a basis of V and \mathcal B^* = \{\varepsilon_1, \varepsilon_2\} be a basis for V^*. This means \varepsilon_i\in V^* and \varepsilon_i(e_j):=\delta_{ij}. Then, the matrix of each dual basis vector is as,

    \[\mathcal M(\varepsilon_i)=\begin{bmatrix}\varepsilon_i(e_1) && \varepsilon_i(e_2)\end{bmatrix} = \begin{bmatrix}\delta_{i1} && \delta_{i2}\end{bmatrix}\]

Change of basis for covectors

Let \mathcal B = \{e_i\}_1^n and \tilde{\mathcal B} = \{\tilde e_i\}_1^n be two bases for V, and hence, \mathcal B^* = \{\varepsilon_i\}_1^n and \tilde{\mathcal B^*} = \{\tilde \varepsilon_i\}_1^n be two bases for V^*. Each dual basis vector \tilde \epsilon_i can be written in terms of the (old) dual basis vectors by using a linear transformation as \tilde \varepsilon_i = Q_{ij}\varepsilon_j. Now, the coefficients Q_{ij} are to be determined as follows,

    \[\begin{split}\tilde \varepsilon_i(e_k) &= Q_{ij}\varepsilon_j(e_k)=Q_{ij}\delta_{jk}=Q_{ik}\\&\implies \tilde \varepsilon_i(e_k) = Q_{ik}\\&\therefore Q_{ij}=\tilde \varepsilon_i(e_j)\end{split}\]

Using the formula e_i=B_{ji}\tilde e_j​​​ regarding the change of basis of vectors, the above continues as,

    \[\begin{split}Q_{ij}&=\tilde \varepsilon_i(e_j)=\tilde \varepsilon_i(B_{kj}\tilde e_k)\\&\text{by linearity of covectors}= B_{kj}\tilde \varepsilon_i(\tilde e_k)=B_{kj}\delta_{ik}=B_{ij}\\\therefore Q_{ij}&=B_{ij}\end{split}\]

This indicates that the dual basis are transformed by the backward transformation. Referring to the index convention, we use superscript for components that are transformed trough a backward transformation. Therefore,

    \[\tilde \varepsilon^i=B_{ij}\varepsilon^j\]

meaning that dual basis vectors are contravariant because they behave contrary to the basis vectors in transformation from e_i to \tilde e_i.

Now let f\in V^*. Writing f=c_i\varepsilon^i=\tilde c_j \tilde \varepsilon^j and using the above relation, we get,

    \[\begin{split}c_i\varepsilon^i&=\tilde c_j \tilde \varepsilon^j \implies c_i F_{ij}\tilde \varepsilon^j=\tilde c_j \tilde \varepsilon^j\\\tilde c_j &= F_{ij}c_i\end{split}\]

meaning that they are transforming in a covariant manner when the basis of the vector space changes from e_i to \tilde e_i.

Briefly the following relations have been shown.

Basis and change of basis for the space of linear maps \mathcal L(V, W)

As can be proved \mathcal L(V,W) is a linear vector space and any linear map is a vector. Therefore, we should be able to find a basis for this space. If V is n-dimensional and W is m-dimensional, the \mathcal L(V,W) is mn-dimensional and hence its basis should have m\times n vectors, i.e. linear maps. Let’s enumerate the basis vectors of \mathcal L as \varphi_{ij}\in \mathcal L (V,W) for i=1, \cdots , m and j=1, \cdots , n, then any linear map T can be written as,

    \[ T = c_{ij}\varphi_{ij}\]

By proposition 1, any linear map is uniquely determined by its action on the basis vectors of its codomain. If \mathcal B = \{e_i\}_1^n be a basis for V, then for any basis vector e_k,


Setting a basis for W as \mathcal B' = \{e_i'\}_1^m, the above equation becomes,


Note that is are dummy indices. This equation holds if,

    \[\begin{matrix}c_{ij}=a_{ik} \text{ and }  \varphi_{ij}(e_k)=e_i' && \text{ if } k=j \\c_{ij} = 0 \text{ and } \varphi_{ij}(e_k)=0 && \text{ if } k\ne j\end{matrix}\]

Therefore, we can choose a set of m\times n basis vectors \varphi_{ij} for \mathcal L (V,W) as,

    \[\varphi_{ij}(e_k) =  \begin{cases}e_i'\text{ if }k=j\\0\text{ if } k\neq  j\end{cases}\]

By recruiting the basis of V^*, the above can be written as,

    \[\{\varphi_i^j}=e_i'\varepsilon^j | i=1,\cdots, m \text{ and } j=1,\cdots, n\}\]

The term e_i'\varepsilon^j is obviously a linear map V\to W. It can be readily shown that c_{ij}\varphi_i^j=c_{ij}e_i'\varepsilon^j being a linear combinations of the derived basis vectors is linearly independent, i.e. c_{ij}e'_i\varepsilon^j(v)=0(v) for any v\in V (here, note that 0\in \mathcal L).

So, a linear map T can be written as a linear combination T = c_{ij}e_i'\varepsilon^j. Here, it is necessary to use the index level convention. To this end, we observe that for a fixed i the term c_{ij} couples with \varepsilon^j and represents the coordinates of a covector. As coordinates of a covector are covariant, index j is written as subscript. For a fixed j though, the term c_{ij} couples with e_i' and represents the coordinates of a vector. As coordinates of a vector are contravariant, index i should rise. Therefore, we write,

    \[T = c_j^ie_i'\varepsilon^j\]

The coefficients c_j^i can be determined as,

    \[\forall e_k\in \mathcal B\quad T(e_k) = c_j^ie_i'\varepsilon^j(e_k)=c_k^ie_i'\]

Stating that c_k^i are the coordinates of T(e_k) with respect to the basis of W, i.e. \mathcal B'. Comparing with what was derived as T(e_k)=T_k^ie_i', we can conclude that c_k^i = T_k^i. Therefore,

    \[T=T_j^i e_i'\varepsilon^j\]

The above result can also be derived from w^i = T_j^iv^j as follows.

    \[\begin{split} w^i &= T_j^iv^j \implies w = (T_j^iv^j)e_i' = T_j^i\varepsilon_j(v)e_i'\\&\therefore T(v) = T_j^i\varepsilon_j(v)e_i' \text{ or } T = T_j^i e_i'\varepsilon^j \end{split}\]

Change of basis of \mathcal L(V,W) is as follows.

For \mathcal L(V,W), let \mathcal B=\{e_i\}_1^n and \tilde {\mathcal B}=\{\tilde e_i\}_1^n be bases for V, and \mathcal B'=\{e_i'\}_1^m and \tilde {\mathcal  B}'=\{\tilde e_i'\}_1^m be bases for W. Also, \mathcal B^*=\{\varepsilon^i\}_1^n and \tilde {\mathcal B}^*=\{\tilde \varepsilon^i\}_1^n are corresponding bases of V^*. Forward and backward transformation pairs in V and W are denoted as (F_{ij}, B_{ij}) and (F'_{ij}, B'_{ij}).

    \[\begin{split} T&=T_j^ie_i'\varepsilon^j = \tilde {T_j^i} \tilde e_i'\tilde \varepsilon^j \\&\implies T_j^ie_i'\varepsilon^j = \tilde {T_j^i} F'_{ki} e_k' B_{js}\varepsilon^s \implies T_s^k = \tilde {T_j^i} F'_{ki} B_{js}\\&(\text{ by } B_{nl}F_{lm}=\delta_{nm}) \ \implies B'_{lk}F_{sr}T_s^k = \tilde {T_j^i} \delta_{li}\delta_{jr}=\tilde T_r^l\\&\therefore \tilde {T_j^i} = B'_{ik}F_{sj}T_s^k\end {split}\]

Note that the coordinates T_s^k of a linear map need two transformations such that the covariant index s of T_s^k pertains to the forward transformation and the contravariant index k pertains to the backward transformation.

Example: let T\in \mathcal L(V,V), then,

    \[T = T_i^je_j\varepsilon^i \quad \tilde {T_s^t} = B_{tj}F_{is}T_i^j\]

If the matrices, [F], [F]^{-1}=[B], and [T] are considered, we can write,

    \[ [\tilde T] =  [F]^{-1}[T][F]\]

Bilinear forms

A bilinear form is a bilinear map defined as T:V\times V\to \mathbb R. Setting a basis for V, a bilinear form can be represented by matrix multiplications on the coordinates of the input vectors. If \{e_i\}_1^n is a basis for V, then

    \[\begin{split} T(u,v)&=T(u^ie_i,v^je_j)=u^iv^jT(e_i, e_j)\\&\implies B(u,v)=u^iv^jT_{ij} \quad \text{ with } T_{ij}:=T(e_i, e_j) \end{split}\]

which can be written as,

    \[[u]^{\rm T}[T][v]\]

where [T]\in\mathbb M^{n\times n} with [T]_{ij}=T_{ij}.

The expression u^iv^jT(e_i, e_j) indicates that a bilinear form is uniquely defined by its action on the basis vectors. This is the same as what was shown for linear maps by proposition 1. This comes from the fact that a bilinear form is a linear map with respect to one of its arguments at a time.

Now we seek a basis for the space of bilinear forms, i.e. \mathcal L_b(V\times V, \mathbb R). This is a vector space with the following defined operations.

    \[\begin{split}\forall B_1, B_2 \in \mathcal B\quad (B_1+B_2)(u,v) &= B_1(u,v) + B_2(u,v)\\\forall \alpha\in \mathbb R\quad \alpha B(u,v) &= B(\alpha u, v)= B(u, \alpha v)\end{split}\]

The dimension of this space is n\times n, therefore, for any bilinear form T there are bilinear forms \rho_{ij}\in \mathcal B_b such that,


From the result T(u,v)=u^iv^jT(e_i, e_j)=u^iv^jT_{i,j} we can conclude that

    \[\begin{split}T(u,v)&=u^iv^jT_{ij}= \varepsilon^i(u)\varepsilon^j(v)T_{ij}\implies T(.,.) = T_{ij}\varepsilon^i(.)\varepsilon^j(.)\\\therefore c_{ij}&=T_{ij}, \quad \rho_{ij}= \varepsilon^i\varepsilon^j\quad \text {or } \rho_{ij}(e_s,e_t)=\begin{cases} 1& s=i \text { and } t=j\ 0& \text {otherwise}\end{cases}\end{split}\]

Following the index level convention, the indices of T_{ij} should stay as subscripts because each index pertains to the covariant coordinates of a covector after fixing the other index.

If \mathcal B and \tilde {\mathcal B} are two bases for V, then the change of basis of the space of bilinear forms are as follows.

    \[\begin{split} T&=T_{ij}\varepsilon^i\varepsilon^j = \tilde T_{ij} \tilde \varepsilon^i\tilde \varepsilon^j = \tilde T_{ij}B_{is}\varepsilon^sB_{jt}\varepsilon^t\\&\implies T_{st}=\tilde T_{ij}B_{is}B_{jt}\\&\therefore \tilde T_{kl} = F_{sk}F_{tl}T_{st}\end {split}\]

Example: the metric bilinear map (metric tensor)

Dot/inner product on the vector space V over \mathbb R is defined as a bilinear map \langle \cdot , \cdot \rangle : V\times V \to \mathbb R such that, \langle u , v \rangle = \langle v , u \rangle and v\ne 0 \iff \langle v , v \rangle > 0. With this regard two objects (that can have geometric interpretations for Euclidean spaces) are defined as,

1- Length of a vector \|v\|^2:=\langle v,v\rangle
2- Angle between two vectors \cos \theta :=\langle u/\|u\|,v/|v\|\rangle

Let see how the dot product is expressed through the coordinates of vectors. With \{e_i\}_1^n being a basis for V, we can write,

    \[u\cdot v :=g\langle u, v \rangle = u^iv^jg_{ij} \quad \text {s.t}\quad g_{ij}=e_i\cdot e_j\]

The term g_{ij} is called the metric tensor and its components can be presented by an n-by-n matrix as [g]_{ij}=e_i\cdot e_j.

If the basis is an orthonormal basis, i.e. e_i\cdot e_j=0 \forall i\ne j, then g_{ij}=\delta_{ij} and [g] is the identity matrix. Therefore, v\cdot u= u^iv^i and \|v\|^2 = v^iv^i.

Multilinear forms

A general multilinear form is a multilinear map defined as T:V_1\times V_2\times \cdots \times V_n\to \mathbb R, where V_i is a vector space. Particularly setting V_i=V leads to a simpler multilinear form as T:V\times V\times \cdots \times V\to \mathbb R.

Following the same steps as shown for a bilinear map, a multilinear form T:V\times V\times \cdots \times V\to \mathbb R can be written as,

    \[\begin{split} T(u,v,\cdots , z)&=T(u^ie_i,v^je_j, \cdots, z^ke_k)= u^iv^j\cdots z^k T(e_i,e_j, \cdots, e_k)\\&\implies T(u,v,\cdots , z)=u^iv^j\cdots z^kT_{ij\cdots k} \quad \text{with}\quad T_{ij\cdots k}:=T(e_i, e_j, \cdots e_k) \end{split}\]

which implies,

    \[T = T_{ij\cdots k}\varepsilon^i \varepsilon^j\cdots \varepsilon^k\]

showing that a multilinear form can be written as a linear combination of covectors.

Multilinear maps

A multilinear map is a map T:V_1\times \cdots \times V_n \to W. A multilinear map can be written in terms of vector and covector basis. For example, consider T:V\times V \to W as T(u,v)=w with \{e_i\}_1^n and \{e'_i\}_1^m being bases for V and W. We can write,

    \[ T(u,v)=T(u^ie_i,v^je_j)=u^iv^jT(e_i,e_j)\]

Because for each i and j we have T(e_i,e_j)\in W, we can write,

    \[ T(e_i,e_j)=T_{ij}^ke'_k\]

We write the indices i and j in T_{ij}^k as subscripts in accordance with their position on the LHS; however, we’ll see that T_{ij}^k is a coordinate of a covector for each i or j when k is fixed. Combining above, we can write,

    \[\begin{split}T(u,v)&=u^iv^jT(e_i,e_j)=u^iv^jT_{ij}^ke'_k=\varepsilon^i(u)\varepsilon^j(v)T_{ij}^ke'_k\\&\therefore T = \varepsilon^i\varepsilon^jT_{ij}^ke'_k\end{split}\]

The term T_{ij}^k collects n\times n \times m coefficients and uniqully defines the multilinear map. We can imagine this term as a 3-dimensional array/matrix. Above also shows that the multilinear map can be written as a linear combination of basis covectors and basis vectors.

Definition of a tensor

Defining the following terms,

  • Vector space V and basis \{e_i\}_1^n and another basis \{\tilde e_i\}_1^n.
  • Basis transformation as \tilde e_j=F_{ij}e_{i}, and therefore e_j=B_{ij}\tilde e_{i}.
  • The dual vector space of V as V^*.
  • Vector space V' and basis \{e_i'\}_1^m and another basis \{\tilde e_i'\}_1^m.
  • Basis transformation as \tilde e_j'=F'_{ij}e'_i, and therefore e_j'=B'_{ij}\tilde e'_i
  • Linear map T\in \mathcal L(V,V').
  • Bilinear form \mathfrak B \in \mathcal L(V,V; \mathbb R).

we concluded that,

    \[\begin{split}v=v^ie_i=\tilde v^i\tilde e_i &\implies \tilde v^i = B_{ij}v^j\quad \text{contravariantly}\\\varepsilon^i, \tilde \varepsilon^i \in V^*, \varepsilon^i(e_j)=\tilde \varepsilon^i(\tilde e_j)=\delta_{ij} &\implies \tilde \varepsilon^i=B_{ij}\varepsilon^j \quad \text{contravariantly}\\f=c_i\varepsilon^i=\tilde c_j \tilde \varepsilon^j&\implies \tilde c_j = F_{ij}c_{i}\quad \text{covariantly}\\T= T_j^ie_i'\varepsilon^j = \tilde {T_j^i} \tilde e_i'\tilde \varepsilon^j &\implies \tilde {T_j^i} = B'_{il}F_{kj}T_k^l\quad \text{contravariant- and covariantly}\\\mathfrak B=\mathfrak B_{ij}\varepsilon^i\varepsilon^j = \tilde {\mathfrak B}_{ij} \tilde \varepsilon^i\tilde \varepsilon^j &\implies \tilde {\mathfrak B}_{ij} = F_{ki}F_{lj}{\mathfrak B}_{kl} \quad \text{covariantly and covariantly}\end{split}\]

It is observed that if a vector v\in V is written in terms a single sum/linear combination of basis vectors of V, then the components of the vectors change contravariantely with respect to a change of basis. Then, the covectors are considered and it is observed that their components change covariently upon change of basis of V^* or V. A linear map can be written as a linear combination of vectors and covectors. The coefficients of this combination is seen to change both contra- and covariantely when the bases (of V and V') change. A bilinear form though can be written in terms of a linear combination of covectors. The corresponding coefficients change covariantly with change of basis. These results can be generalized toward an abstract definition of a mathematical object called a tensor. There are two following approaches for algebraically denfining a tensor.

Tensor as a multilinear form

Motivated by how linear maps, bilinear forms, and multilinear forms and maps can be written by combining basis vectors and covectors, a generalized combination of these vectors can considered. For example,

    \[\mathcal T:=\mathcal T^{ij}_k^l_s^t e_i e_j\varepsilon^k e_l\varepsilon^s e_t\]

This object \mathcal T consists of a linear combination of a unified (merged) set of basis vector and covectors e_i e_j\varepsilon^k e_l\varepsilon^s e_t (of V and V^*) by scalar coefficients \mathcal T^{ij}_k^l_s^t. According to the type of the basis vectors, the indices become sub- or superscript, and hence it determines the type of the transformation regarding that index. By reordering the basis vectors and covectors, we can write,

    \[\mathcal T:=\mathcal T_{ks}^{ijlt} \varepsilon^k \varepsilon^s e_i e_j e_l  e_t\]

Recalling that vector components can be written as v^i=\varepsilon^i (v) implines that there is a map T_{(v)} for a particular vector v such that,

    \[v^i=T_{(v)}(\varepsilon^i) \quad\text{s.t}\quad T_{(v)}:V^* \to \mathbb R\]

And also the components of a covector f\in V^* determined as,

    \[f_i=f(e_i)\equiv T_{(f)}(e_i)\]

motivates defining \mathcal T_{ks}^{ijlt} as the array (collection of the coefficients) of a multilinear form,

    \[T:V\times V\times V^*\times V^*\times V^*\times V^*\to \mathbb R \iff \mathcal T_{ks}^{ijlt}\equiv T_{ks}^{ijlt}=T(e_k,e_s,\varepsilon^i,\varepsilon^j,\varepsilon^l,\varepsilon^t)\]

Therefore, the object or the multi-dimensional array \mathcal T_{ks}^{ijlt} which is dependent on chosen bases of V and V^* and its transformation rules are based on the types of the beases (or indices) can be intrinsically related to an underlying multilineat map T. By virtue of this observation, an object called a tensor is defined as the following.

Definition: A tensor of type (r,s) on a vector space V over a field \mathbb R (or \mathbb C) is a multilinear map as,

    \[\mathcal T: \underbrace{V\times \cdots\times V}_{\text{r times}}\times \underbrace{V^*\times \cdots\times V^*}_{\text{s times}} \to \mathbb R \text{ or } \mathbb C \]

The coordinates or the (scalar) components of a tensor \mathcal T can then be determined once a basis \{e_i\}_1^n for V and a basis \{\varepsilon^i\}_1^n for V^* are fixed. Therefore,

    \[\mathcal T_{i_1\cdots i_r}^{j_1\cdots j_s}\equiv \mathcal T (e_{i_1}, \cdots, e_{i_r}, \varepsilon^{j_1}, \cdots, \varepsilon^{j_s})\]

Note that r is the number of covariant indices and s is the number of contravariant indices. A tensor of type (r,s) can be imagined as an r+s-dimensional array of data containing (\text {dim} V)^{r+s} elements. Each index corresponds to a dimention of the data array.

By this definition, a vector v=v^ie_i is a (0,1) tensor as it can be viewed as,

    \[\begin{split}v&=v^ie_i = \varepsilon^i(v)e_i \implies v^i = \varepsilon^i(v)\\\mathcal T^i&:=v^i\implies \exists \mathcal T:V^*\to \mathbb R\quad \text {s.t}\quad \mathcal T(\varepsilon^i) = v^i\end{split}\]

This implies that for each v\in V there is a (multilinear with one input) map \mathcal T receiving a basis covector and returning the corresponding scalar component of the vector (\mathcal T:V^*\to \mathbb R). This corresponding map is unique for each vector and it is called a tensor.

A covector (dual vector or a linear form) f\in V^* is a (1,0) tensor because a covector f is a linear map \mathcal T:V\to \mathbb R

َ A linear map T:V\toV (or V\to W) is a (1,1) tensor because the term T_j^i in T=T_j^ie_i\varepsilon^j pertains to a covector’s components for a fixed i and to a vector’s components for a fixed j. Therefore,

    \[T_j^i=\mathcal T_j^i=\mathcal T(e_j, \varepsilon^i)\]

in other words, a linear map is a tensor viewed as a multilinear map \mathcal T: V\times V^* \to \mahbb R. Here, the multilibear form can be considered as \mathcal T(v,f) = f\circ T(v) by which gives T_j^i=\mathcal T_j^i=\mathcal(e_j, \varepsilon^i). Note that T(e_i) returns a vector and f being \varepsilon^j extract its j-th coordinate, and hencem the array/matrix of the linear map is retrieved.

A bilinear form \mathfrak B=\mathfrak B_{ij}\varepsilon^i\varepsilon^j is then a (2,0) tensor, where \mathfrak B_{ij}=\mathfrak B(e_i,e_j)

A multilinear form T = T_{ij}^k \varepsilon^i\varepsilon^j e'_k is a (2,1) tensor where T_{ij}^k=\mathcal T_{ij}^k=\mathcal T(e_i,e_j,\varepsilon^k) where \mathcal T:V\times V \times V^* \to \mathbb  R can be considered as \mathcal T (u,v,f)= f\circ T(u,v).

As an example, the cross product of two vectors u,v\in \mathbb R^3 defined as w=u\times v is a multilinear map \mathbb R^3 \times \mathbb R^3 \to \mathbb R^3 is a (2,1) tensor.

By convension, scalars are (0,0) tensors.

Remark: for a tensor \mathcal T_{i_1\cdots i_r}^{j_1\cdots j_s} we can write,

    \[\mathcal T = \mathcal T_{i_1\cdots i_r}^{j_1\cdots j_s} \varepsilon^{i_1}\cdots \varepsilon^{i_r} e_{j_1}\cdots e_{j_s} = \mathcal T (e_{i_1}, \cdots, e_{i_r}, \varepsilon^{j_1}, \cdots, \varepsilon^{j_s})\varepsilon^{i_1}\cdots \varepsilon^{i_r} e_{j_1}\cdots e_{j_s}\]

Example: Stress tensor. The Cauchy stress tensor in mechanics is a linear map and hence a (1,1) tensor.

Rank of a tensor

The rank of a (r,s)-type tensor is defined as r+s. In this regard, tensors of different types can have the same rank. For example tensors of types (1,1), (2,0), (0,2) have the same rank being 2. Here we compare these tensors with eachother.

A (1,1) tensor representing a linear map is \mathcal T_j^i where \mathcal T_j^i=\mathcal T(e_j, \varepsilon^i) with \mathcal T:V\times V^* \to \mathbb R.

A (2,0) tensor representing a bilinear form is \mathcal T_{ij} where \mathcal T_{ij}=\mathcal T(e_i, e_j) with \mathcal T:V\times V \to \mathbb R.

A (0,2) tensor is \mathcal T^{ij} where \mathcal T^{ij}=\mathcal T(\varepsilon^i, \varepsilon^j) with \mathcal T:V^*\times V^* \to \mathbb R.

The coeffieints of the above tensor are collected in 2-dimensional array/matrix; however, they follow different transformation rules based on their types.

PCA: Introduction

Different kind of inter-correlated data can be collected into a random vector. For example we would consider the total length, weight, diameter of nose holes, tail length, number of teeth, colour of eyes, the radius of acetabula, number of vertebrae of dinosaurs as data and collect them into a (random) vector. The principal component analysis (PCA) is a method for reducing the dimension of a random vector (data), and also for discovering outstanding features of data. Using PCA, one can replace a random vector with another vector having a smaller dimension without losing much information. In PCA, the measure of information loss is the total variation of the original variables collected in a random vector.

See this cheat sheet for the mathematical identities used in this post.

Least-squares optimality of PCA

Assume random variables (X_i) collected in a random r-vector X=[X_1,...,X_r]^T , i.e. X\in \mathbb{R}^{r\times 1}. It has a mean \mu_X \in  \mathbb{R}^{r\times 1} and a covariance matrix \Sigma_{XX} \in  \mathbb{R}^{r\times r}. A linear projection of the random variables means:

\xi=BX\text{ s.t. } \xi \in \mathbb{R}^{t\times 1}\text{ and }B \in \mathbb{R}^{t\times r} \\ t\le r

Each rows of B linearly combines the random variables to create a new random variable.

Let’s project the r-vector X onto a lower dimensional space and obtain the t-vector \xi with t\lt r. We want to find a linear map to reconstruct X out of \xi with lower dimension. This means finding a vector \mu \in  {R}^{r\times 1} and a matrix A \in \mathbb{R}^{r\times t} such that  X\approx \mu + A\xi. The approximation is in the least-squares sense of minimizing the mean error. We consider the following as the measure of how well the linear map can reconstruct X out of its linear projection \xi:

\text E\big [(X-\mu – A\xi)^T(X-\mu – A\xi)\big ]\ \in \mathbb{R}^{+}

Therefore, we should find \mu, A, and B by minimizing the following mean:

\text E\big [(X-\mu – ABX)^T(X-\mu – ABX)\big ]\ \in \mathbb{R}^+\tag{1}

This expression is the mean of the squared norm/magnitude of the error vector which itself is a random vevtor.

With the assumption that both A and B are of full rank t, the minimizing parameters will be (see the proof here):

A=\begin{bmatrix}|&|& \dots & |\\v_1 &v_2 &\dots &v_t\\|&|& \dots & | \end{bmatrix}=B^\text T\\ \ \\ \mu=(I_r-AB)\mu_X \tag{2}

where v_i \in \mathbb{R}^{r \times 1} is a normalized eigenvector of the covariance matrix \Sigma_{XX} and it is associated with the i-th largest eigenvalue, \lambda_i, of \Sigma_{XX}. The eigenvectors are orthonormal to each other, i.e. v_i^\text{T}v_j=\delta_{ij} (\delta_{ij} is the Kronecker delta). I_r is the identity matrix.

From the structures of the matrices A and B, it can be observed that they have rank t. Therefore, the best rank-t approximation of the original random vector X is \tilde X:

X\approx \tilde X= \mu_x+C(X-\mu_x)\qquad \text{s.t }C=AB=AA^\text T \in \mathbb{R}^{r\times r}\tag{3a}


\tilde X= \mu_x+\sum_{i=1}^{t}(v_iv_i^\text T)(X_c)\qquad \text{s.t }X_c=X-\mu_x\tag{3b}

Remark: \text E\big[X\big]=\text E\big[\tilde X\big]=\mu_x, and \text{Var}(\tilde X)=C\Sigma_{XX}C^\text T= C\text{Var}(X) C^\text T \in \mathbb{R}^{r\times r}.

The covariance matrix \Sigma_{XX} \in \mathbb{R}^{r\times r} is real and symmetric, therefore it has a spectral decomposition as \Sigma_{XX}=V \varLambda V^\text T s.t V \in \mathbb{R}^{r\times r} is an orthogonal matrix (a matrix with ortho-normal columns/rows) with columns as normalized eigenvectors of \Sigma_{XX} , and  \varLambda is a diagonal matrix having the eigenvalues, \lambda_i, of \Sigma_{XX} . The eigenvectors in V are sorted such that the j-th column of V is associated with the j-th largest eigenvalue. Having said that, we observe that the best rank-t approximation of the original random vector only uses the first t eigenvectors of random vector’s covariance matrix.

The minimum of Eq. 1 is:

\sum_{i=t+1}^{r} \lambda_i \tag{4}

Principal components (PC’s) of X: the first t principal components (also known as Karhunen-Loeve transformation) of X are defined by the following linear projections:

\xi^{(t)}=A^{\text T}X\ \in \mathbb{R}^{t\times 1} \implies\xi_i^{(t)}=v_i^{\text T}X\qquad i=1,\dots ,t\le r\tag{5}

The superscript (t) is used to indicate the dimension of the PC vector, being equal to the number of PC’s.

Note that PC’s are themselves random variables as they are linear combination of the data random variables.

PC’s of X are uncorrelated: the covariance between each pair of the principal components is:

\text {Cov}(\xi_i,\xi_j)=\text {Cov}(v_i^{\text T}X,v_j^{\text T}X)= v_i^{\text T}\Sigma_{XX}v_j=\lambda_j v_i^{\text T}v_j=\delta_{ij}\lambda_j\\ \tag{6}

Interestingly \text {Cov}(\xi_i,\xi_j)=0\ \text{ for } i\neq j. This means that principal components of X are uncorrelated. Moreover, \text {Cov}(\xi_i,\xi_i)= \text {Var}(\xi_i)\lt  \text {Var}(\xi_j)\iff  i\lt j.

A goodness-of-fit measure: equation 4 maps the r-dimensional data (X) into a lower t-dimensional data \xi. The total variance of the original data, i.e. X_i collected in X is \sum_{i=1}^{r}\text{Var}(X_i)=\text{Trace}(\Sigma_{XX})= \sum_{i=1}^{r}\lambda_i. Based on this, a measure on how well the first t principal components (in the lower r-D space) represent the r original data variables is considered as (also note Eq. 4 and 6):

\frac{\sum_{i=t+1}^{r} \lambda_i}{\sum_{i=1}^{r} \lambda_i}=\frac{\text{total variance of }X_i\text{\textquoteright s}-\text{total variance of }\xi_i\text{\textquoteright s}}{\text{total variance of }X_i\text{\textquoteright s}}\tag{7}

This implies that the measure is small if the first t principal components contain a large part of the total variation in X; hence, better approximation.

Remark: if t=r, Eq. 7 becomes zero and it means perfect match, i.e no approximation. In other words, t=r means A=V (V as in the SVD of \Sigma_{XX}= V \varLambda V^\text T), then because V is unitary, C=VV^\text T=I_r, hence (Eq. 3),  \tilde X= \mu_x+I(X-\mu_x)=X.

PCA as a variance-maximization method

The main aim of PCA is to reduce the dimension of data. The way a dinosaur would do a sort of reduction on a data vector is keeping one component and eating the rest all. It is a perfectly nonsense reduction, but many pieces of valuable information are lost. Another way, is to do a simple average r^{-1}\sum_{i=1}^{r}X_i. This approach is yet undesirable because it treats all components as equal importance (equal weight). Another approach is to do a weighted average with some constraint on the weights, like: \sum_{i=1}^{r}w_iX_i such that \sum_{i=1}^{r}w_i^2=1. In matrix notation and using the Euclidean vector norm:

\sum_{i=1}^{r}w_iX_i=w^\text TX \quad \text{ s.t }\quad \sum_{i=1}^{r}w_i^2=w^\text T w=\|w\|^2=1\qquad w_i\in \mathbb{R}

This constraint weighted average is called a standardised linear combination (SLC). However, the problem is how to uniquely and meaningfully determine the weights. PCA chooses the weights in order to maximize the variances of the SLC’s of the variables X_i‘s, or to maximize the variances of the linear projections of the random vector, i.e. w^\text T X,\ \in\mathbb{R}. SLC, is in fact the projection of the vector X onto the direction described by the weight vector. In a mathematical notation, we want to solve the following maximization problem:

\max_{\{w:\|w\|=1\}}\text{Var}(w^\text TX)=\max_{\{w:\|w\|=1\}}w^\text T\text{Var}(X)w=\max_{\{w:\|w\|=1\}}w^\text T\Sigma_{XX}(X)w\\ \ \\ w\in \mathbb{R}^{r\times 1},\quad X\in \mathbb{R}^{r\times 1} \quad,\Sigma_{XX}\in \mathbb{R}^{r\times r} \ \tag{8a}

In other words,

v=\argmax_{\{w:\|w\|=1\}}\text{Var}(w^\text TX)\tag{8b}

Principal components of X are defined based on this maximization. The 1st PC of X, denoted by \xi_1 is the SLC (of the variables) with the highest variance. This is obtained by the maximizing problem expressed by Eq. 8. \text{PC}_1 then has the highest variance among all other SLC of the variables/projections of X:

v_1=\argmax_{\{w:\|w\|=1\}}\text{Var}(w^\text TX)\qquad v_1\in \mathbb{R}^{r\times 1} \\ \ \\ \text {PC}_1:=\xi_1:=v_1^\text TX\qquad \xi_1\in\mathbb{R}\ \tag{9}

v_1 is called the 1st principal direction (PD) or principal axis.

The 2nd PC of X, denoted by \xi_2 is the SLC (of the variables) with the highest variance such that the maximizing weight vector,v_2 is also subjected to the orthogonality constraint v_2^\text T v_1=0. In a mathematical notation, the 2nd PC is obtained by the following maximization:

v_2=\argmax_{\{w:\|w\|=1 \text{ and }w^\text Tv_1=0\}}\text{Var}(w^\text TX) \\ \ \\ \text {PC}_2:=\xi_2:=v_2^\text TX\qquad \ \tag{10}

by the same token, the i-th PC and i-th PD or principal axis are defined by the following:

v_i=\argmax_{\{w:\|w\|=1\text{ and }w^\text T v_j=0 \text{ for all } j=1,\dots,i-1\}}\text{Var}(w^\text TX) \\ \ \\ \text {PC}_i:=\xi_i:=v_i^\text TX\qquad \ \tag{11}

It can be proved (see the references) that the maximizing weight vectors are the normalized eigenvectors of \Sigma_{XX} associated with its eigenvalues sorted in the decreasing order, i.e.:

\Sigma_{XX}=V\varLambda V^\text T\ \text{ s.t } \lambda_{i}\ge \lambda_{j} \implies i\lt j\\ \ \\ \text{i.e. }\lambda_1\ge \lambda_2\ge \dots \ge\lambda_r \ge 0


\text{cov}(\xi_i,\xi_j)=\text {Cov}(v_i^\text TX,v_j^\text TX)=v_i^\text T\Sigma_{XX} v_j=\lambda_j v_i^{\text T}v_j=\delta_{ij}\lambda_j\\ \ \\ \implies \text{Var}(\xi_1)\ge \text{Var}(\xi_2)\ge\dots\ge\text{Var}(\xi_r)\ge0

which means PC’s not only have the maximum variances among all the possible SLC of the variables, or all the possible projections of the random vector, but also they are mutually uncorrelated.

In a matrix form, we can write:

\xi=V^\text TX\qquad \xi,X\in\mathbb{R}^{r\times 1},\ V^\text T\in\mathbb{R}^{r\times r}\tag{12}

where X and \xi are the random vector and its linear projection vector containing the PC’s of X respectively.

It is readily concluded that:

\text{Var}(\xi)=\text{Var}(V^\text TX)=V^\text T\Sigma_{XX}V=\delta_{ij}\lambda_j

PCA in practice: Sample PCA

In practice, the PC’s are to be estimated out of n independent observations of the random r-vector X of data. This requires estimating of the mean and covariance matrix of the random vector of data; using sample mean and covariance.

Here, we use the notation \bold X_i\in \mathbb{R}^{r\times 1} for the i-th independent trial to observe the random vector X.

Note: we denote an observation vector with a capital letter \bold X_i to indicate that it is still a random vector (out of the same population and through independent trials). To indicate a particular realization/value of \bold X_i we shall use small letter (by convention).

We collect the sequence of n independent observations of the random vector X in a set as:

\{\bold X_i, i=1,2,\dots,n\}\ \text{ s.t }\bold X_i\in\mathbb{R}^{r\times 1}\tag{13}

The estimate for \mu_X is given by:

\hat \mu_X=\bold{\bar X}:=\frac{1}{n}\sum_{i=1}^{n}\bold X_i\tag{14}

let \bold X_{ci}=\bold X_i-\bold{\bar X} for n observations of the random vector; and let \chi_c={\bold X_{c1},\dots,\bold X_{cn}}\in \mathbb{R}^{r\times n}. Then, the estimate for the covariance matrix is:

\hat\Sigma_{XX}=\frac{1}{n}\chi_c\chi_c^\text T\tag{15}

Ordered eigenvalues and their associated eigenvectors of \hat\Sigma_{XX} is then estimate the population parameters. This leads to writing:

\hat\lambda_1\ge\hat\lambda_2\ge\dots\ge\hat\lambda_r\ge 0\\ \ \\ \hat A=(\hat v_1,\dots,\hat v_t)=\hat B^\text T\tag{16}

In practice, the vectors of independent observations are collected as columns (or rows) of a matrix (just re-writing Eq. 13). Let’s call this matrix the matrix of observations. Here, each column represents an observation of X:

\bold X=\begin{bmatrix}|&|&\dots &|\\ \bold X_1 & \bold X_2 &\dots &\bold X_n \\|&|&\dots &| \end{bmatrix}\in \mathbb{R}^{r\times n}\tag{17}

Recalling Eq. 5 or Eq. 12, PC’s can be calculated for each columns of the matrix of observations. In this regards, we obtain estimates of the PC’s of X:

\hat\xi_{ij}^{(t)}=\hat v_j^\text T\bold X_i\ \in\mathbb{R}\\ j=1,\dots,t\ \text{ and }\ i=1,\dots,n

This is the estimate of the j-th PC of X calculated based on the i-th observation \bold X_i. The term \hat\xi_{ij}^{(t)} is called the score for the i-th observation on the j-th PC.

For each observation, we can write the above in a more compact from:

\hat\xi_{i}^{(t)}=\hat A^\text T\bold X_i\ \in\mathbb{R}^{t\times 1}\\ i=1,\dots,n\tag{18}


– Modern Multivariate Statistical Techniques, A. J. Izenman.
– Applied Multivariate Statistical Analysis, W. Hardle, L. Simar.

Although mostly paraphrased, this post might contain copyrighted sentences of the references.


Procrustes Analyses

Some definitions*

1) The Shape of an object is all the geometrical information that remains when, location, scale, and rotational effects are removed from and object. In other words, shape of an object is invariant under Euclidean similarity transformations, which are translation, isotropic scaling, and rotation.

2) The Form of an object is all the geometrical information that remains when location and rotational effect of an objects are removed. Form is also called size-and-shape. Form is invariant under rigid-body transformations. Here, we can say shape is the form with size removed (the size/scale information is removed).

3) Same form and same shape: two objects have the same form, size-and-shape, if they can be translated and rotated relative to each other so that they exactly match. Two objects have the same shape, if they can be translated, rotated and scaled relative to each other so that they exactly match.

The words matching, registration, superimposing and fitting are equivalent.

Fig. 1. Shape vs form.

4) A landmark is a point of correspondence on each object. e.g tip of the nose of all the dinosaurs in a population/sample.

5) Geometric shape analysis: An object can be (discretely) represented by a collection of landmark (point cloud), therefore, the landmarks coordinates retain (continue to have) the geometry of the point-cloud configuration. This approach to shape analysis is called the geometric shape analysis.

6) The Configuration is a collection (set) of landmarks on/of a particular object. The configuration of a m-D object with k landmarks can be represented by a k\times m matrix called the configuration matrix. For a 3D object we have:

\begin{bmatrix} x_1&y_1&z_1\\ x_2&y_2&z_2\\ x_3&y_3&z_3\\ \vdots& \vdots&\vdots\\ x_k&y_k&z_k\\ \end{bmatrix}

7) A size measure of a configuration is any positive real-valued function of the configuration matrix, X, such that \forall a\in \mathbb{R}\ s(aX)=as(X).

8) The centroid size is defined as the square root of the sum of squared Euclidean distances from each landmark to the centroid of a configuration:


where X_{i,} is the i-th row of X, and \overline{X}:=\frac{1}{k}\sum_{i=1}^{k}{X_{i,}}\in \mathbb{R}^m,\ m=3\text{ for 3D}. \|.\| is the Euclidean vector norm.

The centroid size can be re-written using matrix algebra:


where C:=I_k-\frac{1}{k} 1_k 1_k^T with I_k being the k\times k identity matrix and 1_k\in\mathbb{R}^{k\times 1} is called the centering matrix, and \|.\| is the Frobenius norm, sometimes also called the Euclidean norm, of a matrix, i.e.
\|A\|:=\sqrt{trace(A^\text{T}A)}. If the centroid of X is already located at the origin, then C=I_k.

9) The Euclidean similarity transformations of a configuration matrix X are the set of translated, rotated, and isotropically scaled X. In a mathematical notation:

\{\beta X \Gamma + 1_k\gamma^{\text T}:\beta\in \mathbb{R}^+, \Gamma\in SO(m),\gamma\in \mathbb{R}^m\}\tag{3}

where, \beta \in\mathbb{R}^{+} is the scale (factor), SO(m) is the space of rotational matrices, \gamma\in\mathbb{R}^m is the translation vector. Vectors are considered as column vectors. 1_k\in\mathbb{R}^k is a column vector of ones. m=3 for 3D.

10) Rigid-body transformations of a configuration matrix X are the set of translated and rotated X. In a mathematical notation:

\{X \Gamma + 1_k\gamma^{\text T}:\beta\in \mathbb{R}^+, \Gamma\in SO(m),\gamma\in \mathbb{R}^m\}\tag{4}

11) The centered pre-shape of a configuration matrix X is obtained (defined) by removing the location and scale/size information of the configuration. Location can be removed by centering the configuration, i.e. translating the configuration by moving its centroid to the origin. Therefore, the translated configuration is X_C:=CX. The size is removed by scaling the configuration by its size defined as Eq. 2. Therefore, the pre-shape of X is obtained as:


Removing size or scale means scaling to the unit size. Thereby, two configurations with different sizes lose their size information and also their scale. The scale is a quantity that expresses the ratio between sizes of two configurations, either two with different shapes or two with the same shape (one is the scaled version of the other one). In another scenario (like the full Procrustes analyses), if two or more configuration are scaled with different scale (factors), they also lose their scale/size information (relative to each other) although their sizes may not be unit.

12) Reworded definition of a shape: A shape is an equivalence class of geometrical figures/objects/configurations modulo (what remains after) the similarity transformation. The shape of X is denoted by [X]. In order to visualise the shape of an configuration, a representative of the equivalence class is considered and called icon, denoted by [X].

Two configurations X\in\mathbb{R}^{k\times m} and X'\in\mathbb{R}^{k\times m} have the same shape, i.e. [X]=[X'] iff there exist \beta,\Gamma,\text{ and }\gamma such that:

X’=\beta X \Gamma + 1_k\gamma^{\text T}

In other words, two configurations have the same shape if they perfectly match after the removal of their location, rotation, and size/scale.

Fig. 2. An equivalence class of configurations/objects modulo the similarity transformation and its icon. All configurations with the same shape are in the same class/set.

13) Shape space is the space/set of all shapes, i.e. equivalence classes. For example \{[X_1], [X_2] , [X_3],...\}. All have their locations, rotations, and size removed.

14) Reworded definition of form (size-and-shape): form is an equivalence class of geometrical configurations modulo translation and rotation. The form/size-and-shape of a configuration/object X is denoted by [X]_S, i.e an icon for the class. Two configuration X\in\mathbb{R}^{k\times m} and X'\in\mathbb{R}^{k\times m} have the same form, i.e. [X]_S=[X']_S iff there exist \Gamma\text{ and }\gamma such that:

X’=\Gamma X + 1_k\gamma^{\text T}
Fig. 3. An equivalence class of configurations/objects modulo rotation and translation and its icon. All configurations with the same form are in the same class/set.

15) Size-and-Shape/Form space is the space/set of all forms, i.e. equivalence classes. For example \{[X_1]_S, [X_2]_S , [X_3]_S,...\}.

16) By removing the size of a form (scaling to unit centroid size) of a configuration, the shape of the configuration is obtained.

17) Clarification on the term “removing”: let’s imagine a set of n configurations (n\ge 1). We want to remove the scale, location, and rotation(al) information of the configurations. Removal of scale/size is to scale all the configurations to unit size (each divided by for example its centroid size).

Removal of location is to translate all the configurations to a particular target point/location in the space by selecting the same single landmark or point of reference (like the centroid) of each configuration and (rigidly) translating the configuration along the vector from the point of reference to the target point. The target point can be the origin of the coordinate system/frame of reference.

Removal of rotation of the configurations is through optimising over rotations by minimising some criterion. For instance, to remove the rotations (rotational information) of two configurations with their locations (and maybe scales/sizes) already removed, it is needed to apply a particular relative rotation on them (e.g. keep one stationary and rotate the other one) so that they get closest, with respect to some definition of distance, as possible. Once the rotation is removed, then any deviation of one configuration from the other one originates from their locations, scales/sizes, and their shapes. To compare shapes, we also have to remove the locations and scales of the configurations. Then any deviation is due to the shape of the configurations. If only location and rotational information are removed, then any relative deviation originates from the sizes and shapes (form) of the configurations.

Matching two configurations and measure of distance between shapes

Consider a collection/set of shapes, as defined above, all having the same number of landmarks and the landmarks are in one-to-one correspondence. An important question is how to define a distance between two shapes in the set. This is natural to define a measure between the members of a set. The notion of shape distance indicates the difference between two shapes, i.e. how far two shapes are far from each other in the shape space.

A natural approach is to match/fit/superimpose/register a configuration X to a configuration Y (same number of landmarks and in correspondence) using the similarity transformations applied on X, and then the magnitude of difference between the fitted configuration of X, i.e. \widehat{X} and Y indicates the magnitude of difference in shape between the two configurations X and Y.

Fig. 5. Configuration X transforms up to similarity transformation to get fitted onto configuration Y.

So how to match two configuration? The full ordinary Procrustes analysis (FOPA) is the method to match/register/superimpose/fit two (m-D) configurations with the same number of landmarks as closely as possible onto each other. This method applies the similarity transformations to one of the configurations to closely match the other one. But what is the measure of closeness? The definition of the method makes it perspicuous:

The method of FOPA matches two configurations through transforming one onto the other one (target configuration) up to the similarity transformations and minimizing the squared Frobenius/Euclidean norm of the difference between the transforming and the target configurations. This means to minimize the following expression:

D_{FOPA}^2(X_1,X_2):=\|X_2-(\beta X_1\Gamma+1_k\gamma^\text{T})\|^2\tag{6}

The minimum value of D_{FOPA} is a measure of similarity/dissimilarity. Observe that X_1 is the transforming configuration (input) and X_2 is the target configuration. Also, all the similarity transformations are involved in the registration; this is why the term full is used. In order to find the matching configuration, the similarity parameters should be determined, i.e.:

(\hat\beta, \hat\Gamma, \hat\gamma)=\argmin_{\beta,\Gamma,\gamma}\|X_2-\beta X_1\Gamma-1_k\gamma^\text{T}\|^2\tag{7}

It is good to know that \hat\gamma=0 if the configurations are already centered.

The full Procrustes fit of X_1 onto X_2 is then:

X_1^{FP}=\hat\beta X_1\hat\Gamma+1_k\hat\gamma^\text{T}\tag{8}

It is tempting to chose the minimum of Eq. 6 as a measure of distance between the shapes of the two configurations. However,  min\ D_{FOPA}^2(X_1,X_2)\neq min\ D_{FOPA}^2(X_2,X_1), i.e. not commutative, unless the configurations both have the same size; for example having the unit size by getting pre-scaled. Therefore, the full Procrustes distance, d_F between two shapes (of two configurations) is defined as:

d_F^2(X_1,X_2):=\min\ D_{FOPA}^2(\frac{X_1}{\|X_1\|},\frac{X_2}{\|X_2\|})\tag{9}

FOPA involves isotropic scaling of the input configuration, in addition to the other two similarity transformations, to register it onto the target configuration. This means that the fitted configuration does not have the same size/scale as its original configuration. Therefore, FOPA removes the size of the input configuration.

Another scenario of fitting the input configuration to the target one is to register while preserving the size/scale of the input, i.e. not to scale the input (the scale of the target is also preserved). In this regard, partial ordinary Procrustes analysis (POPA) is a method of registration only over translation and rotation to match two configurations. This involves the minimization of:


The partial Procrustes fit of X_1 onto X_2 is then: (see Eq. 7 with \beta=1):


Based on POPA, the partial Procrustes distance can be defined. If the two configurations have the same size, like by getting pre-scaled to the unit size, i.e. the size is removed, then the following is defined as the the partial Procrustes distance between two shapes (of two configurations) d_P(X_1,X_2):

d_P^2(X_1,X_2):=\min\ D_{POPA}^2(\frac{X_1}{\|X_1\|},\frac{X_2}{\|X_2\|})\tag{12}

Calculating POPA distance, unlike the FOPA distance, does not involve additional scaling of the the (pre-scaled) unit sized configurations, i.e. \frac{X_1}{\|X_1\|},\frac{X_2}{\|X_2\|} in the minimization of D_{POPA}^2. Nevertheless, obtaining POPA distance still entails removing the scale of the configurations.

Remark: the two defined distances have different values.

Mean shape

First some motivation from the arithmetic mean of numbers. Every dinosaur knows that the arithmetic mean/average of n real numbers is defined as: \hat\mu:=\frac{1}{n} \sum_{1}^{n} x_i. Let’s define a function as g(\mu):=(\frac{1}{n} \sum_{1}^{n} x_i)-\mu. In other words:

g(\mu):=\frac{1}{n} \sum_{i=1}^{n} (x_i-\mu)

It is now obvious that if we let \mu=\hat\mu then g(\mu)=0. This is what we want out of defining the function. However, zero is not the minimum of f(\mu) as this function is an absolutely decreasing function, hence no local or global minimum (just take the derivative w.r.t \mu). Let’s define a function as:

f(\mu):=\frac{1}{n} \sum_{i=1}^{n} (x_i-\mu)^2

What minimizes f(\mu)? Solving \frac{\mathrm d}{\mathrm d \mu}f(\mu)=0, we get:

\frac{\mathrm d}{\mathrm d \mu}f(\mu)=\frac{1}{n} \sum_{i=1}^{n} -2(x_i-\mu)=0 \newline \implies \frac{1}{n} \sum_{i=1}^{n} (x_i-\mu)=0

This is just g(\mu)=0 which has the answer \hat\mu. The second derivative of f(\mu) is \frac{\mathrm d^2}{\mathrm d \mu^2}f(\mu)=2>0 which indicates that \hat\mu minimizes f(\mu).

Using the Euclidean distance/metric/norm of real numbers, i.e. d_E(x_1,x_2):=|x_1-x_2| with |.| being the absolute value, we can re-define f(\mu) as:

f(\mu):=\frac{1}{n} \sum_{i=1}^{n} |x_i-\mu|^2=\frac{1}{n} \sum_{1}^{n} d_{E}^2(x_i,\mu)

This results in (it can be easily proved by equating the derivative with zero):

\hat\mu=\argmin_{\mu}\frac{1}{n} \sum_{i=1}^{n} d_{E}^2(x_i,\mu)

This motivation suggests that for a set of objects with a defined distance function/metric, a mean/average object can be imagined. This average object is such an object that minimizes the average sum of squared distances from all objects to the average object. The Mean shape of a set of shapes sampled from a population of shapes can be defined using the distance functions as defined by Eq. 9 and Eq. 12:

The sample full Procrustes mean shape is defined as:

\hat\mu_F:=\arg\inf_{\mu}\frac{1}{n} \sum_{i=1}^{n} d_{F}^2(X_i,\mu)\tag{13}

The sample partial Procrustes mean shape is defined as:

\hat\mu_P:=\arg\inf_{\mu}\frac{1}{n} \sum_{i=1}^{n} d_{P}^2(X_i,\mu)\tag{14}

In words, the mean shape is a shape with minimum distance from each configuration’s shape.

Note that in both definitions, the configurations will be pre-scaled to the unit size (due to the definition of the metrics).

Remark: the mean shape refers to the mean of the shapes not the mean of configurations.

Matching more than two configurations: generalized Procrustes analyses

The term FOPA/POPA refers to Procrustes matching of one configuration
onto another. If there are more than two configurations, it is natural to register/match them all to a mean shape or mean form. To do so, generalized Procrustes analysis (GPA) is utilized to superimpose two or more configurations onto a common unknown mean shape/form. GPA has two methods: full and partial GPA’s. The GPA provides a practical method of computing the sample mean shape defined by Eq. 13 or 14. Unlike the full/partial OPA, the output of GPA (the Procrustes fits) does not depend on the order of superimposition.

Full GPA

The method of full GPA transforms n\geq 2 configurations up to the similarity transformations (scaling + rotation + translation) with respect to a common unknown mean shape [\mu] in order to minimise the following total sum of squares with respect to the transformation parameters and \mu, while imposing a size constraint to avoid a trivial solution. A trivial solution can be all \beta_i close to zero (in order not to make a configuration vanished, scale is not allowed to be zero) and no translation; hence the mean shape will be something with size close to zero, i.e almost vanished.

G(X_1,…X_n):=\sum_{i=1}^{n} \|\beta_iX_i\Gamma_i+1_k\gamma_i^\text{T}-\mu\|^2\tag{15} \newline \sum_{i=1}^{n}S^2(\beta_iX_i\Gamma_i+1_k\gamma_i^\text{T})=\sum_{i=1}^{n}S^2(X_i)

In the above expression, \beta_i,\Gamma_i,\gamma_i, \text{ and } \mu are unknown. Once the expression is solved, the estimate of the optimum/minimizing parameters, \hat\beta_i,\hat\Gamma_i\,\hat\gamma_i, and estimate of the mean shape, \hat\mu are found. Note that if we already knew the population mean shape, we would just align each configuration separately to that mean shape.

Once the above minimization is solved, the full (generalized) Procrustes fit of each configuration (onto the estimated mean) is:


Note that the scale \beta_i of each transformed configuration (full Procrustes fit) is not necessarily the same. Therefore, full GPA does not preserve the scale of the configurations through their optimal transformations. This means that if one configuration is \alpha times larger than the other one, they will be \beta times of each other after the optimal transformation and getting fitted to the mean shape.

Assuming that \hat\beta_i,\hat\Gamma_i\,\hat\gamma_i,\hat\mu minimize expression 15, we can write:

\hat\mu=\argmin_{\mu}\sum_{i=1}^{n} \|\hat\beta_iX_i\hat\Gamma_i+1_k\hat\gamma_i^\text{T}-\mu\|^2\text{ or}\tag{17} \newline \hat\mu=\argmin_{\mu}\sum_{i=1}^{n} \|X_i^{FP}-\mu\|^2

The solution is then (see the proof here):

\hat\mu=\frac{1}{n}\sum_{i=1}^{n} X_i^{FP}\tag{18}

which is the arithmetic mean of the Procrustes fits. This means that once the configurations are registered to the estimate of the unknown mean, their arithmetic mean/average equals the estimated mean.

Now we want to show that the estimated mean shape \hat\mu=\frac{1}{n}\sum_{i=1}^{n} X_i^{FP} has the same shape as the sample full Procrustes mean shape defined by Eq. 13. Without loss of generality, we pre-scale the configurations X_i to the unit size, and replace the size constraint with S^2(\mu)=1.

Since Eq. 15 is some of positive reals, the optimum transformation parameters should also minimize each term in the sum. This means that \hat\beta_i,\hat\Gamma_i\,\hat\gamma_i minimizes each \|\beta_iX_i\Gamma_i+1_k\gamma_i^\text{T}-\mu\|^2. Therefore, following Eq. 13, the sample full Procrustes mean shape of the full Procrustes fit is:

\hat\mu_F:=\arg\inf_{\mu}\frac{1}{n} \sum_{i=1}^{n}\min_{\beta_i,\Gamma_i,\gamma_i}\|(\beta_i X_i\Gamma_i+1_k\gamma_i^\text{T})-\mu\|^2 \newline =\arg\inf_{\mu}\frac{1}{n} \sum_{i=1}^{n}\|(\hat\beta_i X_i\hat\Gamma_i+1_k\hat\gamma_i^\text{T})-\mu\|^2 \newline=(n^{-1})\arg\inf_{\mu}\sum_{i=1}^{n}\|X_i^{FP}-\mu\|^2

Therefore, (by Eq. 16 and 17), \hat\mu_F= n^{-2}\frac{1}{n}\sum_{i=1}^{n} X_i^{FP}. These two have the same shape, which means there is a similarity transformation to perfectly superimpose them on each other. Here, it is obvious that only scaling by the factor of n^{-2} is needed.

Estimating the transformation parameters of full GPA method in order to estimate the sample full Procrustes mean shape and registering all configurations to the estimated mean shape, is then equivalent to the following minimization problem:

\inf_{\beta_i,\Gamma_i,\gamma_i}\sum_{i=1}^{n} \|(\beta_i X_i\Gamma_i+1_k\gamma_i^\text{T}) – \frac{1}{n} \sum_{j=1}^{n} (\beta_j X_j\Gamma_j+1_k\gamma_j^\text{T})\|^2\\ \sum_{i=1}^{n}S^2(\beta_iX_i\Gamma_i+1_k\gamma_i^\text{T})=\sum_{i=1}^{n}S^2(X_i)

It can be proved that above is equal to (See the proof here):

\inf_{\beta_i,\Gamma_i,\gamma_i}\frac{1}{n}\sum_{i=1}^{n} \sum_{\underset{{j\lt n}}{j=i+1}}^{n}\|(\beta_i X_i\Gamma_i+1_k\gamma_i^\text{T}) – (\beta_j X_j\Gamma_j+1_k\gamma_j^\text{T})\|^2\\ \newline \sum_{i=1}^{n}S^2(\beta_iX_i\Gamma_i+1_k\gamma_i^\text{T})=\sum_{i=1}^{n}S^2(X_i)

Above means that the full GPA minimizes the sum of pairwise squared Frobenius distances between all the transformed versions of the configurations ( (full Procrustes fits) in the sample. It should be noted that, the output of the full GPA is configurations that are translated, rotated, and scaled versions of the original configurations. Hence, the scale, location, and rotational information of each configuration is removed. The output configurations are registered onto each other such that the sum of pairwise squared Frobenius distances between all of them is minimized. If the scale of each output configuration is removed by re-scaling it to unit size, then full Procrustes distance between each two shapes (of the output configurations), and also the estimated mean shape (once re-scaled to unit size) can be calculated.

Remark: the full GPA does not lead to the minimized (squared) Frobenius distance between two full Procrustes fits , but it leads to the minimized sum of the Frobenius distances of each full Procrustes fit and the rest of them in the set. It does minimize the (squared) Frobenius distance between each full Procrustes fit and the (full Procrustes) mean shape though.

Remark: The full GPA removes the scale of the configurations by scaling them in order to make them closer to the mean shape. The scales are not the same for all the configurations and hence the full GPA does not preserve the (original) scale (relative size) of the configurations. The full Procrustes fits and the estimated mean shape may not have the unit size. However, they are relatively scaled in a way to minimize the sum of squared distances. This is the notion of removing the scale information (of the original configurations). Therefore, the Procrustes fits represent the shape (information) of the configurations and the differences between each of them (between corresponding landmarks) and the mean shape is purely due to their shape variations.

In summary, the full GPA of configurations leads to a mean shape and full Procrustes fits such that each full Procrustes fit is superimposed onto the mean shape through the similarity transformations; hence, whatever geometrical variations/discrepancies remains between each full Procrustes fit and the mean shape is purely due to their shapes.

An algorithm for full GPA is as follows:

1- Translations/Removing locations: remove the location of each configuration by centring each configuration (each configuration is translated to the origin by its centroid) and initially let each Procrustes fits be:

X_i^{FP}= CX_i\text{ for } i=1,…,n

2- estimate \hat\mu as \frac{1}{n}\sum_{i=1}^{n} X_i^{FP}.

3- Rotations and scales: perform a FOPA on each centred configuration X_i^{FP} to rotate and scale it to \hat\mu. Now each recently transformed configuration, X_i^{*FP} , is fitted to the estimated mean.

{X}_i^{*FP}=\hat\beta_i X_i^{FP} \hat\Gamma_i

\hat\beta_i and \hat\Gamma_i are transformation parameters out of the FOPA. Note that the rotation is about the origin, and no translation is needed as the locations are already removed.

4- Update the mean shape: \hat\mu=\frac{1}{n}\sum_{i=1}^{n} {X}_i^{*FP}.

5- let X_i^{FP}={X}_i^{*FP}. Re-centring is not necessary because the centroid of each configuration at this stage is already located at the origin and the rotations are also about the origin. Re-scaling won’t change the location of the centroid because it is isotropic.

6- Repeat 3 to 5 until everyone’s happy, i.e. the change in the Procrustes sum of squares (\sum_{i=1}^{n} \|\beta_iX_i\Gamma_i+1_k\gamma_i^\text{T}-\mu\|^2) from one iteration to the next one becomes less than a particular tolerance.

Remark: The configurations can be pre-scaled to have the unit size but it is not necessary because their size is removed once they get scaled to better fit the estimated mean at each iteration.

Partial GPA

Partial GPA involves merely registration of a set of configurations by translation and rotation (not scaling). Partial GPA preserves the scale/size of the configurations as opposed to full GPA. This method is appropriate for analysis of forms, i.e. size-and-shapes. Partial GPA is through minimization of the following sum with respect to the transformation parameters (\Gamma_i, \gamma_i) and and unknown form \mu. Size constraint is not needed as there is no scaling involved.

G_P(X_1,…X_n):=\sum_{i=1}^{n} \|X_i\Gamma_i+1_k\gamma_i^\text{T}-\mu\|^2\tag{19}

Once the above minimization is solved, the partial Procrustes fit of each configuration onto the common form is:


Similar to Eq. 17 and 18, we can write:

\hat\mu=\frac{1}{n}\sum_{i=1}^{n} X_i^{PP}\tag{21}

However, \hat\mu is not going to have the same shape as the mean shapes previously defined (sample full/partial Procrustes mean shape). \hat\mu is then a form that its Frobenius distance to each partial Procrustes fit is minimum. I and some dinosaur think that this partial GPA is appropriate to group-register a set of configurations onto each other, however, we don not know its application at this moment.

The algorithm for partial GPA is the same as that of the full GPA except that the scale is removed i.e. set as 1.

* The definitions follow the book “statistical shape analysis, Ian L. Dryden, Kanti V. Mardia”, and the article “Procrustes Methods in the Statistical Analysis of Shape, Colin Goodall”. I may not have paraphrased them all, hence, some sentences may be subjected to copy right and they are not intended to be reused in official publications.