Point Distribution Models of Shapes and Forms

A point distribution model (PDM) is a model that locates (or determines the geometrical (spatial) distribution of) the points of the shapes or forms (size-and-shapes) in a particular shape or form population. The word distribution refers to the spatial distribution of the points; not a probability distribution. This is my best understanding of the definition of PDM’s so far.

A PDM approximates the geometrical distribution of the points of shapes/forms in a population. This means that a PDM is intended to model/express all the possible shapes/forms in the population.

A PDM entails shapes or forms. This means we are working with superimposed configurations modulo their location, rotation, and scale information for the case of shapes, and their location and rotation information in the case of forms. Therefore, we assume that shapes or forms are available through performing a full or partial generalized Procrustes analysis (GPA) on the given configurations. In other words, a shape is an instance of the full Procrustes fits, and a form is an instance of partial Procrustes fits. Henceforward, whatever includes the concept of a shape in this article also holds for a form unless declared.

As explained, a shape can be (approximately) represented using landmarks, i.e. discrete set of points. By choosing a 3D Cartesian coordinate system (CS) with orthogonal basis vectors and x,y, and z coordinates (axes), the landmark coordinates can then be collected in a matrix. One can readily think of collecting the coordinates in a column vector (in fact every matrix is isomorphic to a vector). Therefore, a shape in a set of shapes, all having the same number of N landmarks is represented as a vector called the shape vector [1]:

X_i=\big[x_{1x}^i,x_{1y}^i,x_{1z}^i,\dots,x_{Nx}^i,x_{Ny}^i,x_{Nz}^i\big]^\text T \in\mathbb{R}^{3N\times 1}\tag{1}

See this post for linear algebra conventions a identities.

A shape vector X is a member of the vector space \mathbb{R}^{3N\times 1}. Therefore, there are bases, and X can be written as a linear combination of the basis vectors. By default, the 3N-D X vector is born in the vector space spanned by the standard basis. Indeed, other bases are possible.

Using a basis \{v_1,\dots,v_{3N}\} for the vector space, we can write any shape vector as a deviation from an arbitrary constant (shape) vector a\in \mathbb{R}^{3N\times 1} :

X=a+\sum_{i=1}^{3N}b_iv_i\ ,b_i\in \mathbb{R}\\ \text {or}\\ X=a+Vb\quad ,\quad \text{s.t. }\ V\in\mathbb{R}^{3N\times 3N},\quad V_{,i}=v_i,\quad v_i^\text Tv_j=\delta_{ij}

It should be noted that, a is a point (position vector) and Vb is a free vector in the concept of geometrical modelling.

On the other hand, X is a random vector with with a mean \mu_X and a covariance \Sigma_{XX}. The constant vector a can be replaced with \mu_X. Any other shape vector, like a known arbitrary shape vector in the population, can also be substituted for a. The basis vectors can be chosen such that the linear projection of X onto them generates random variables each having maximum possible variance. These vectors are obtained by PCA as a variance-maximization method. Because \Sigma_{XX} is symmetric, it has 3N independent and mutually orthogonal eigenvectors, and hence any shape vector can be written as:

\begin{gathered} X=\mu_X+Vb=\mu_X+\sum_{i=1}^{3N}b_iv_i \tag{2} \\ \ \\ \text{s.t. }\ \Sigma_{XX}=V\varLambda V^\text T\\ \varLambda\text{ is diagonal and }\varLambda_{ii}=\lambda_i\quad\text{s.t. } \lambda_1\ge\lambda_2\ge \dots \ge \lambda_{3N}\ge 0 \end{gathered}

where, the term V\varLambda V^\text T is the singular value decomposition (SVD) of  \Sigma_{XX} such that the eigenvalues, \lambda_i‘s, collected in the diagonal matrix \varLambda are sorted in the decreasing order. As the covariance matrix is symmetric, SVD is equivalent to the spectral/eigenvalue decomposition of the matrix.

So far we showed that a shape instance (represented by a shape vector) in the population can be generated by deforming the mean shape through a deformation field (vectors) which is determined by linear combinations of the eigenvectors of  \Sigma_{XX} . The eigenvectors of  \Sigma_{XX} are principal (3N dimensional) directions along which the variances of the linearly projected (random) shape vector are maximized. The principal directions are called deformation modes (deformations from the mean). The deformation modes are also called eigenshapes. The combining coefficients b_i‘s are called the shape model parameters. The deformation modes weighted by the shape model parameters adds up to produce a deformation vector that will be added to the mean shape vector to generate a shape. Such a representation is called the modal representation; this is the idea behind a PDM.

By choosing proper shape parameters, the model generates shapes in the shape population. It should be noted that there is generally no restrictions on the values of b_i‘s; in other words, the deformation modes can be combined arbitrarily. However, a generated shape out of an arbitrary values of b_i‘s may not fall into the shape population; this is not difficult to imagine.

If we let U:=X-\mu_X, then U is a random vector expressing the deviation of a shape from the mean shape of the population. I call this vector the deviation or the deformation shape vector. This vector contains deformation components (in three directions of the CS) of the mean shape at each of its points. The mean of the deformation shape vector is zero, \mu_U=0. In fact,  \Sigma_{XX}=\Sigma_{UU}. This means that the elements of  \Sigma_{XX} are variances (or covariances) of the deformation components. U can be linearly projected on the directions (unit vectors) defined by the eigenvectors of  \Sigma_{XX}:

U_p=V^\text T U\ \in\mathbb{R}^{3N\times 1}

and by considering Eq. 2, we can conclude (it is simply the result of the linear projection):

U=Vb\implies U_p=V^\text T Vb=I_rb=b

Just as a side note: \mu_U=\mu_{U_p}=\mu_b=0.

On the other hand \Sigma_{bb}=\Sigma_{U_pU_p}=V \Sigma_{UU}V^\text T = \varLambda. This is what we can conclude:

\begin{gathered}X=\mu_X+Vb=\mu_X+\sum_{i=1}^{3N}b_iv_i \tag{3} \\ \ \\ \text{s.t. }\ \Sigma_{bb}=\varLambda= \begin{bmatrix}\lambda_1&0&\dots\ &0\\ 0 &\lambda_2\ &\dots\ &0\\ \vdots&\vdots &\ddots &\vdots\\ 0&0&0&\lambda_{3N} \end{bmatrix}\quad \text{and}\quad \mu_b=0\\ \ \\ \text{with}\\ \ \\ \|b_iv_i\|=|b_i|\end{gathered}

Note that \mu_b=0 because \mu_U=0.

It is obvious that the magnitude of b_i controls the magnitude of deformation of the mean shape by each mode. From above, the standard deviation of b_i is \sqrt{\lambda_i}. It is assumed that a PDM generates plausible shapes, or well behaved shapes, if the shape parameters are limited as:

-3\sqrt{\lambda_i}\le b_i\le +3\sqrt{\lambda_i}\tag{4}

A plausible shape is created by adding a plausible deformation shape vector to the mean shape vector; a plausible deformation shape vector is a deformation vector that generates shapes still belong to the shape family under consideration.

The covariance matrix \Sigma_{XX} (and also \Sigma_{bb}) is positive semi-definite; therefore, \lambda_i\ge 0. This implies that the variance of some b_i‘s can be zero, i.e. the zero-variance b_i‘s are constant. In that case, b_i=\mu_{b_i}=0 simply because no variation about the mean. Therefore, a deformation mode associated with a shape parameter b_i with zero variance (\lambda_i=0) vanishes from the modal representation. Assuming M zero eigenvalues (including algebraic multiplicity), we can write the modal representation of a shape (vector) as:

X=\mu_X+\sum_{i=1}^{3N-M}b_iv_i \tag{5}

Shape approximation

A PDM is based on the modal representation of shapes (Eq. 2 or 3). This kind of representation is useful when it comes to approximation of a shape. In this regard, only some of the deformation modes being relatively more effective in forming a shape is kept and the rest of the modes are ignored. We already showed (Eq. 5) that zero eigenvalue of the covariance matrix leads to zero variation of the corresponding shape parameter and consequently vanishing of the associated mode in the modal representation. This motivates us to throw out the shape parameters with relatively small variances, hence, their associated modes; thereby, approximating a shape vector by a t-term compact modal representation:

\begin{gathered} X\approx \tilde X=\mu_X+\tilde Vb=\mu_X+\sum_{i=1}^{t}b_iv_i\\ \ \\ \text{s.t. }\ b\in \mathbb{R}^{t\times 1}\ , \tilde V=[v_1\quad v_2\ \dots\quad v_t] \in \mathbb{R}^{3N\times t}\ ,\text{ and } t<3N \tag{6} \end{gathered}

where v_i‘s are as in Eq.2.

Moreover, by the Least-squares optimality of PCA, we can approximate the shape vector as (see Eq. 3a in this post):

\begin{gathered} X\approx\tilde X =\mu_X + \tilde V\tilde V^\text T(X-\mu_X)= \mu_X +\sum_{i=1}^{t}v_iv_i^\text T(X-\mu_X) \tag{7} \\ \ \\ \text{s.t. }\ b\in \mathbb{R}^{t\times 1}\ , \tilde V=[v_1\quad v_2\ \dots\quad v_t] \in \mathbb{R}^{3N\times t}\ ,\text{ and } t<3N \end{gathered}

This indicates that for approximating a particular shape vector, the shape model parameters should be chosen as:

b=\tilde V^\text T(X-\mu_X)\qquad \text{or}\qquad b_i=v_i^\text T(X-\mu_X)\tag{8}

But, how many modes i.e. t do we need? it can be determined based on the goodness-of-fit measure defined for PCA as a dimension reduction tool (Eq. 7 in this post):

g(t) = \frac{\sum_{i=t+1}^{3N} \lambda_i}{\text{Trace}(\Sigma_{XX})=\sum_{i=1}^{3N} \lambda_i}\tag{9}

Therefore, we only need to keep the first t modes having large associated eigenvalues. Actually, we hope that the first t eigenvalues are substantially larger than the rest of the eigenvalues. As a suggestion g(t)\le 0.05 is ok.

Sample shapes (data set)

In practice, the mean and covariance of the shape (random) vector X are unknown and should be estimated. Let S=\{S_1,\dots,S_n\} be a set of sample shapes (each represented by a shape vector S_i) from a particular shape family population. This is, in fact, a set of independent observations of identically distributed random variables. This set is called the training set. As a remark, it should be noted that we generally start with a set of configurations, X_c=\{C_1,\dots ,C_n\} out of a configuration population; for example the set of all dinosaurs. Then, we perform the full or the partial GPA to obtain a set of shapes/forms already denoted by S. Having said that, the following estimates are considered [2]:

\begin{gathered} \hat \mu_X=\bar S:=\frac{1}{n}\sum_{i=1}^{n} S_i \ \in \mathbb{R}^{3N\times 1}\tag{10}\\ \ \\ \hat\Sigma_{XX}= \frac{1}{n-1}\sum_{i=1}^n (S_i-\bar S)(S_i-\bar S)^\text T\ \in \mathbb{R}^{3N\times 3N} \end{gathered}

where n is the number of training shapes. The covariance estimate is a symmetric matrix, therefore, it has 3N independent eigenvectors, however, has at most rank n (properties of rank in this post), i.e. non-zero eigenvalues. Therefore, the modal representation of the training shapes and other shapes in the population predicted based on the model generated using sample shapes (training shapes) is:

X=\bar S+\sum_{i=1}^{n}b_i\hat v_i \tag{11}

where \hat v_i is an eigenvector of \hat\Sigma_{XX}.

When n is large enough, then we can use the following criterion to select the first t modes with relatively large eigenvalues:

g(t) = \frac{\sum_{i=t+1}^{n} \lambda_i}{\text{Trace}(\hat \Sigma_{XX})=\sum_{i=1}^{n} \lambda_i}\tag{12}

This criterion is similar to Eq. 9.

PDM’s with normal distributions

It can be assumed that a shape vector X, being a random vector, has a multivariate normal distribution i.e. [1]:

X\sim \mathscr N(\mu_X,\Sigma_{XX})\tag{13}

This assumption can be manipulated toward a more practical form. Let’s consider the following proposition:

If Y\in \R^{r\times 1} has a multivariate normal distribution Y\sim \mathscr N(\mu_Y,\Sigma_{YY}), A\in \R^{s\times r}, and c\in \R^{s\times 1} being a constant vector, then, the linear transformation X=AY+c is distributed as  X\sim \mathscr N(A\mu_Y+c,A\Sigma_{YY}A^\text T) .

Using the proposition and writing the shape vector in its modal form as X=\mu_X+Vb, we can conclude:

X=\mu_X+Vb\sim \mathscr N(\mu_X,\Sigma_{XX})\\ \ \\ \iff X\sim \mathscr N(V\mu_b+\mu_X,V\Sigma_{bb}V^\text T)\\ \ \\ \iff b\sim \mathscr N(0,\Sigma_{bb})

where \Sigma_{bb} is the diagonal matrix of the eigenvalues as in Eq. 3.

By the marginalization property of normal distributions:

b_i\sim \mathscr N(0,\lambda_i)

and, we can write:

\alpha_i:=\frac{b_i}{\sqrt{\lambda_i}}\implies \alpha_i\sim \mathscr N(0,1)

Therefore, the modal representation of the shape vector becomes:

X=\mu_X+\sum_{i=1}^{3N}\alpha_i\sqrt{\lambda_i}v_i\qquad \text{s.t }\ \alpha_i\sim \mathscr N(0,1) \tag{14}

For shape vector approximation and sample shape vector, the following equations (based on Eq. 6 and 11) are readily obtainable:

\tilde X=\mu_X+\sum_{i=1}^{t}\alpha_i\sqrt{\lambda_i}v_i\qquad \text{s.t }\ \alpha_i\sim \mathscr N(0,1) \tag{15} X=\bar S+\sum_{i=1}^{n}\alpha_i\sqrt{\lambda_i}\hat v_i\qquad \text{s.t }\ \alpha_i\sim \mathscr N(0,1) \tag{16}

Shape space spanned by the PDM

A shape vector X\in \R^{3N} can be represented by any of the equations 6, 11, 15, or 16. In any of the equations, a 3N dimensional shape vector is presented by a linear combination of basis vectors whose number of them is less than the dimension of the shape vector. This means that a PDM spans a space (of shape vectors) which is a subspace of \R^{3N} . Therefore, a PDM model established by training shapes can only constitutes shapes born out of the linear combinations of the shape modes, \hat v_i‘s. Such a model then misses the shapes (in the population) which are not captured by the training set of shapes or cannot be represented by linear combinations of the shape modes.


[1] Gaussian Process Morphable Models, Marcel Luthi , Thomas Gerig, Christoph Jud, and Thomas Vetter. IEEE Transactions on Pattern Analysis and Machine Intelligence (Volume: 40 , Issue: 8 , Aug. 1 2018).

[2] Image Processing and Analysis (Chapter 7: Model-Based Methods in Analysis of Biomedical Images). R. Baldock and J. Graham, Oxford University Press, 2000.


Procrustes Analyses

Some definitions*

1) The Shape of an object is all the geometrical information that remains when, location, scale, and rotational effects are removed from and object. In other words, shape of an object is invariant under Euclidean similarity transformations, which are translation, isotropic scaling, and rotation.

2) The Form of an object is all the geometrical information that remains when location and rotational effect of an objects are removed. Form is also called size-and-shape. Form is invariant under rigid-body transformations. Here, we can say shape is the form with size removed (the size/scale information is removed).

3) Same form and same shape: two objects have the same form, size-and-shape, if they can be translated and rotated relative to each other so that they exactly match. Two objects have the same shape, if they can be translated, rotated and scaled relative to each other so that they exactly match.

The words matching, registration, superimposing and fitting are equivalent.

Fig. 1. Shape vs form.

4) A landmark is a point of correspondence on each object. e.g tip of the nose of all the dinosaurs in a population/sample.

5) Geometric shape analysis: An object can be (discretely) represented by a collection of landmark (point cloud), therefore, the landmarks coordinates retain (continue to have) the geometry of the point-cloud configuration. This approach to shape analysis is called the geometric shape analysis.

6) The Configuration is a collection (set) of landmarks on/of a particular object. The configuration of a m-D object with k landmarks can be represented by a k\times m matrix called the configuration matrix. For a 3D object we have:

\begin{bmatrix} x_1&y_1&z_1\\ x_2&y_2&z_2\\ x_3&y_3&z_3\\ \vdots& \vdots&\vdots\\ x_k&y_k&z_k\\ \end{bmatrix}

7) A size measure of a configuration is any positive real-valued function of the configuration matrix, X, such that \forall a\in \mathbb{R}\ s(aX)=as(X).

8) The centroid size is defined as the square root of the sum of squared Euclidean distances from each landmark to the centroid of a configuration:


where X_{i,} is the i-th row of X, and \overline{X}:=\frac{1}{k}\sum_{i=1}^{k}{X_{i,}}\in \mathbb{R}^m,\ m=3\text{ for 3D}. \|.\| is the Euclidean vector norm.

The centroid size can be re-written using matrix algebra:


where C:=I_k-\frac{1}{k} 1_k 1_k^T with I_k being the k\times k identity matrix and 1_k\in\mathbb{R}^{k\times 1} is called the centering matrix, and \|.\| is the Frobenius norm, sometimes also called the Euclidean norm, of a matrix, i.e.
\|A\|:=\sqrt{trace(A^\text{T}A)}. If the centroid of X is already located at the origin, then C=I_k.

9) The Euclidean similarity transformations of a configuration matrix X are the set of translated, rotated, and isotropically scaled X. In a mathematical notation:

\{\beta X \Gamma + 1_k\gamma^{\text T}:\beta\in \mathbb{R}^+, \Gamma\in SO(m),\gamma\in \mathbb{R}^m\}\tag{3}

where, \beta \in\mathbb{R}^{+} is the scale (factor), SO(m) is the space of rotational matrices, \gamma\in\mathbb{R}^m is the translation vector. Vectors are considered as column vectors. 1_k\in\mathbb{R}^k is a column vector of ones. m=3 for 3D.

10) Rigid-body transformations of a configuration matrix X are the set of translated and rotated X. In a mathematical notation:

\{X \Gamma + 1_k\gamma^{\text T}:\beta\in \mathbb{R}^+, \Gamma\in SO(m),\gamma\in \mathbb{R}^m\}\tag{4}

11) The centered pre-shape of a configuration matrix X is obtained (defined) by removing the location and scale/size information of the configuration. Location can be removed by centering the configuration, i.e. translating the configuration by moving its centroid to the origin. Therefore, the translated configuration is X_C:=CX. The size is removed by scaling the configuration by its size defined as Eq. 2. Therefore, the pre-shape of X is obtained as:


Removing size or scale means scaling to the unit size. Thereby, two configurations with different sizes lose their size information and also their scale. The scale is a quantity that expresses the ratio between sizes of two configurations, either two with different shapes or two with the same shape (one is the scaled version of the other one). In another scenario (like the full Procrustes analyses), if two or more configuration are scaled with different scale (factors), they also lose their scale/size information (relative to each other) although their sizes may not be unit.

12) Reworded definition of a shape: A shape is an equivalence class of geometrical figures/objects/configurations modulo (what remains after) the similarity transformation. The shape of X is denoted by [X]. In order to visualise the shape of an configuration, a representative of the equivalence class is considered and called icon, denoted by [X].

Two configurations X\in\mathbb{R}^{k\times m} and X'\in\mathbb{R}^{k\times m} have the same shape, i.e. [X]=[X'] iff there exist \beta,\Gamma,\text{ and }\gamma such that:

X’=\beta X \Gamma + 1_k\gamma^{\text T}

In other words, two configurations have the same shape if they perfectly match after the removal of their location, rotation, and size/scale.

Fig. 2. An equivalence class of configurations/objects modulo the similarity transformation and its icon. All configurations with the same shape are in the same class/set.

13) Shape space is the space/set of all shapes, i.e. equivalence classes. For example \{[X_1], [X_2] , [X_3],...\}. All have their locations, rotations, and size removed.

14) Reworded definition of form (size-and-shape): form is an equivalence class of geometrical configurations modulo translation and rotation. The form/size-and-shape of a configuration/object X is denoted by [X]_S, i.e an icon for the class. Two configuration X\in\mathbb{R}^{k\times m} and X'\in\mathbb{R}^{k\times m} have the same form, i.e. [X]_S=[X']_S iff there exist \Gamma\text{ and }\gamma such that:

X’=\Gamma X + 1_k\gamma^{\text T}
Fig. 3. An equivalence class of configurations/objects modulo rotation and translation and its icon. All configurations with the same form are in the same class/set.

15) Size-and-Shape/Form space is the space/set of all forms, i.e. equivalence classes. For example \{[X_1]_S, [X_2]_S , [X_3]_S,...\}.

16) By removing the size of a form (scaling to unit centroid size) of a configuration, the shape of the configuration is obtained.

17) Clarification on the term “removing”: let’s imagine a set of n configurations (n\ge 1). We want to remove the scale, location, and rotation(al) information of the configurations. Removal of scale/size is to scale all the configurations to unit size (each divided by for example its centroid size).

Removal of location is to translate all the configurations to a particular target point/location in the space by selecting the same single landmark or point of reference (like the centroid) of each configuration and (rigidly) translating the configuration along the vector from the point of reference to the target point. The target point can be the origin of the coordinate system/frame of reference.

Removal of rotation of the configurations is through optimising over rotations by minimising some criterion. For instance, to remove the rotations (rotational information) of two configurations with their locations (and maybe scales/sizes) already removed, it is needed to apply a particular relative rotation on them (e.g. keep one stationary and rotate the other one) so that they get closest, with respect to some definition of distance, as possible. Once the rotation is removed, then any deviation of one configuration from the other one originates from their locations, scales/sizes, and their shapes. To compare shapes, we also have to remove the locations and scales of the configurations. Then any deviation is due to the shape of the configurations. If only location and rotational information are removed, then any relative deviation originates from the sizes and shapes (form) of the configurations.

Matching two configurations and measure of distance between shapes

Consider a collection/set of shapes, as defined above, all having the same number of landmarks and the landmarks are in one-to-one correspondence. An important question is how to define a distance between two shapes in the set. This is natural to define a measure between the members of a set. The notion of shape distance indicates the difference between two shapes, i.e. how far two shapes are far from each other in the shape space.

A natural approach is to match/fit/superimpose/register a configuration X to a configuration Y (same number of landmarks and in correspondence) using the similarity transformations applied on X, and then the magnitude of difference between the fitted configuration of X, i.e. \widehat{X} and Y indicates the magnitude of difference in shape between the two configurations X and Y.

Fig. 5. Configuration X transforms up to similarity transformation to get fitted onto configuration Y.

So how to match two configuration? The full ordinary Procrustes analysis (FOPA) is the method to match/register/superimpose/fit two (m-D) configurations with the same number of landmarks as closely as possible onto each other. This method applies the similarity transformations to one of the configurations to closely match the other one. But what is the measure of closeness? The definition of the method makes it perspicuous:

The method of FOPA matches two configurations through transforming one onto the other one (target configuration) up to the similarity transformations and minimizing the squared Frobenius/Euclidean norm of the difference between the transforming and the target configurations. This means to minimize the following expression:

D_{FOPA}^2(X_1,X_2):=\|X_2-(\beta X_1\Gamma+1_k\gamma^\text{T})\|^2\tag{6}

The minimum value of D_{FOPA} is a measure of similarity/dissimilarity. Observe that X_1 is the transforming configuration (input) and X_2 is the target configuration. Also, all the similarity transformations are involved in the registration; this is why the term full is used. In order to find the matching configuration, the similarity parameters should be determined, i.e.:

(\hat\beta, \hat\Gamma, \hat\gamma)=\argmin_{\beta,\Gamma,\gamma}\|X_2-\beta X_1\Gamma-1_k\gamma^\text{T}\|^2\tag{7}

It is good to know that \hat\gamma=0 if the configurations are already centered.

The full Procrustes fit of X_1 onto X_2 is then:

X_1^{FP}=\hat\beta X_1\hat\Gamma+1_k\hat\gamma^\text{T}\tag{8}

It is tempting to chose the minimum of Eq. 6 as a measure of distance between the shapes of the two configurations. However,  min\ D_{FOPA}^2(X_1,X_2)\neq min\ D_{FOPA}^2(X_2,X_1), i.e. not commutative, unless the configurations both have the same size; for example having the unit size by getting pre-scaled. Therefore, the full Procrustes distance, d_F between two shapes (of two configurations) is defined as:

d_F^2(X_1,X_2):=\min\ D_{FOPA}^2(\frac{X_1}{\|X_1\|},\frac{X_2}{\|X_2\|})\tag{9}

FOPA involves isotropic scaling of the input configuration, in addition to the other two similarity transformations, to register it onto the target configuration. This means that the fitted configuration does not have the same size/scale as its original configuration. Therefore, FOPA removes the size of the input configuration.

Another scenario of fitting the input configuration to the target one is to register while preserving the size/scale of the input, i.e. not to scale the input (the scale of the target is also preserved). In this regard, partial ordinary Procrustes analysis (POPA) is a method of registration only over translation and rotation to match two configurations. This involves the minimization of:


The partial Procrustes fit of X_1 onto X_2 is then: (see Eq. 7 with \beta=1):


Based on POPA, the partial Procrustes distance can be defined. If the two configurations have the same size, like by getting pre-scaled to the unit size, i.e. the size is removed, then the following is defined as the the partial Procrustes distance between two shapes (of two configurations) d_P(X_1,X_2):

d_P^2(X_1,X_2):=\min\ D_{POPA}^2(\frac{X_1}{\|X_1\|},\frac{X_2}{\|X_2\|})\tag{12}

Calculating POPA distance, unlike the FOPA distance, does not involve additional scaling of the the (pre-scaled) unit sized configurations, i.e. \frac{X_1}{\|X_1\|},\frac{X_2}{\|X_2\|} in the minimization of D_{POPA}^2. Nevertheless, obtaining POPA distance still entails removing the scale of the configurations.

Remark: the two defined distances have different values.

Mean shape

First some motivation from the arithmetic mean of numbers. Every dinosaur knows that the arithmetic mean/average of n real numbers is defined as: \hat\mu:=\frac{1}{n} \sum_{1}^{n} x_i. Let’s define a function as g(\mu):=(\frac{1}{n} \sum_{1}^{n} x_i)-\mu. In other words:

g(\mu):=\frac{1}{n} \sum_{i=1}^{n} (x_i-\mu)

It is now obvious that if we let \mu=\hat\mu then g(\mu)=0. This is what we want out of defining the function. However, zero is not the minimum of f(\mu) as this function is an absolutely decreasing function, hence no local or global minimum (just take the derivative w.r.t \mu). Let’s define a function as:

f(\mu):=\frac{1}{n} \sum_{i=1}^{n} (x_i-\mu)^2

What minimizes f(\mu)? Solving \frac{\mathrm d}{\mathrm d \mu}f(\mu)=0, we get:

\frac{\mathrm d}{\mathrm d \mu}f(\mu)=\frac{1}{n} \sum_{i=1}^{n} -2(x_i-\mu)=0 \newline \implies \frac{1}{n} \sum_{i=1}^{n} (x_i-\mu)=0

This is just g(\mu)=0 which has the answer \hat\mu. The second derivative of f(\mu) is \frac{\mathrm d^2}{\mathrm d \mu^2}f(\mu)=2>0 which indicates that \hat\mu minimizes f(\mu).

Using the Euclidean distance/metric/norm of real numbers, i.e. d_E(x_1,x_2):=|x_1-x_2| with |.| being the absolute value, we can re-define f(\mu) as:

f(\mu):=\frac{1}{n} \sum_{i=1}^{n} |x_i-\mu|^2=\frac{1}{n} \sum_{1}^{n} d_{E}^2(x_i,\mu)

This results in (it can be easily proved by equating the derivative with zero):

\hat\mu=\argmin_{\mu}\frac{1}{n} \sum_{i=1}^{n} d_{E}^2(x_i,\mu)

This motivation suggests that for a set of objects with a defined distance function/metric, a mean/average object can be imagined. This average object is such an object that minimizes the average sum of squared distances from all objects to the average object. The Mean shape of a set of shapes sampled from a population of shapes can be defined using the distance functions as defined by Eq. 9 and Eq. 12:

The sample full Procrustes mean shape is defined as:

\hat\mu_F:=\arg\inf_{\mu}\frac{1}{n} \sum_{i=1}^{n} d_{F}^2(X_i,\mu)\tag{13}

The sample partial Procrustes mean shape is defined as:

\hat\mu_P:=\arg\inf_{\mu}\frac{1}{n} \sum_{i=1}^{n} d_{P}^2(X_i,\mu)\tag{14}

In words, the mean shape is a shape with minimum distance from each configuration’s shape.

Note that in both definitions, the configurations will be pre-scaled to the unit size (due to the definition of the metrics).

Remark: the mean shape refers to the mean of the shapes not the mean of configurations.

Matching more than two configurations: generalized Procrustes analyses

The term FOPA/POPA refers to Procrustes matching of one configuration
onto another. If there are more than two configurations, it is natural to register/match them all to a mean shape or mean form. To do so, generalized Procrustes analysis (GPA) is utilized to superimpose two or more configurations onto a common unknown mean shape/form. GPA has two methods: full and partial GPA’s. The GPA provides a practical method of computing the sample mean shape defined by Eq. 13 or 14. Unlike the full/partial OPA, the output of GPA (the Procrustes fits) does not depend on the order of superimposition.

Full GPA

The method of full GPA transforms n\geq 2 configurations up to the similarity transformations (scaling + rotation + translation) with respect to a common unknown mean shape [\mu] in order to minimise the following total sum of squares with respect to the transformation parameters and \mu, while imposing a size constraint to avoid a trivial solution. A trivial solution can be all \beta_i close to zero (in order not to make a configuration vanished, scale is not allowed to be zero) and no translation; hence the mean shape will be something with size close to zero, i.e almost vanished.

G(X_1,…X_n):=\sum_{i=1}^{n} \|\beta_iX_i\Gamma_i+1_k\gamma_i^\text{T}-\mu\|^2\tag{15} \newline \sum_{i=1}^{n}S^2(\beta_iX_i\Gamma_i+1_k\gamma_i^\text{T})=\sum_{i=1}^{n}S^2(X_i)

In the above expression, \beta_i,\Gamma_i,\gamma_i, \text{ and } \mu are unknown. Once the expression is solved, the estimate of the optimum/minimizing parameters, \hat\beta_i,\hat\Gamma_i\,\hat\gamma_i, and estimate of the mean shape, \hat\mu are found. Note that if we already knew the population mean shape, we would just align each configuration separately to that mean shape.

Once the above minimization is solved, the full (generalized) Procrustes fit of each configuration (onto the estimated mean) is:


Note that the scale \beta_i of each transformed configuration (full Procrustes fit) is not necessarily the same. Therefore, full GPA does not preserve the scale of the configurations through their optimal transformations. This means that if one configuration is \alpha times larger than the other one, they will be \beta times of each other after the optimal transformation and getting fitted to the mean shape.

Assuming that \hat\beta_i,\hat\Gamma_i\,\hat\gamma_i,\hat\mu minimize expression 15, we can write:

\hat\mu=\argmin_{\mu}\sum_{i=1}^{n} \|\hat\beta_iX_i\hat\Gamma_i+1_k\hat\gamma_i^\text{T}-\mu\|^2\text{ or}\tag{17} \newline \hat\mu=\argmin_{\mu}\sum_{i=1}^{n} \|X_i^{FP}-\mu\|^2

The solution is then (see the proof here):

\hat\mu=\frac{1}{n}\sum_{i=1}^{n} X_i^{FP}\tag{18}

which is the arithmetic mean of the Procrustes fits. This means that once the configurations are registered to the estimate of the unknown mean, their arithmetic mean/average equals the estimated mean.

Now we want to show that the estimated mean shape \hat\mu=\frac{1}{n}\sum_{i=1}^{n} X_i^{FP} has the same shape as the sample full Procrustes mean shape defined by Eq. 13. Without loss of generality, we pre-scale the configurations X_i to the unit size, and replace the size constraint with S^2(\mu)=1.

Since Eq. 15 is some of positive reals, the optimum transformation parameters should also minimize each term in the sum. This means that \hat\beta_i,\hat\Gamma_i\,\hat\gamma_i minimizes each \|\beta_iX_i\Gamma_i+1_k\gamma_i^\text{T}-\mu\|^2. Therefore, following Eq. 13, the sample full Procrustes mean shape of the full Procrustes fit is:

\hat\mu_F:=\arg\inf_{\mu}\frac{1}{n} \sum_{i=1}^{n}\min_{\beta_i,\Gamma_i,\gamma_i}\|(\beta_i X_i\Gamma_i+1_k\gamma_i^\text{T})-\mu\|^2 \newline =\arg\inf_{\mu}\frac{1}{n} \sum_{i=1}^{n}\|(\hat\beta_i X_i\hat\Gamma_i+1_k\hat\gamma_i^\text{T})-\mu\|^2 \newline=(n^{-1})\arg\inf_{\mu}\sum_{i=1}^{n}\|X_i^{FP}-\mu\|^2

Therefore, (by Eq. 16 and 17), \hat\mu_F= n^{-2}\frac{1}{n}\sum_{i=1}^{n} X_i^{FP}. These two have the same shape, which means there is a similarity transformation to perfectly superimpose them on each other. Here, it is obvious that only scaling by the factor of n^{-2} is needed.

Estimating the transformation parameters of full GPA method in order to estimate the sample full Procrustes mean shape and registering all configurations to the estimated mean shape, is then equivalent to the following minimization problem:

\inf_{\beta_i,\Gamma_i,\gamma_i}\sum_{i=1}^{n} \|(\beta_i X_i\Gamma_i+1_k\gamma_i^\text{T}) – \frac{1}{n} \sum_{j=1}^{n} (\beta_j X_j\Gamma_j+1_k\gamma_j^\text{T})\|^2\\ \sum_{i=1}^{n}S^2(\beta_iX_i\Gamma_i+1_k\gamma_i^\text{T})=\sum_{i=1}^{n}S^2(X_i)

It can be proved that above is equal to (See the proof here):

\inf_{\beta_i,\Gamma_i,\gamma_i}\frac{1}{n}\sum_{i=1}^{n} \sum_{\underset{{j\lt n}}{j=i+1}}^{n}\|(\beta_i X_i\Gamma_i+1_k\gamma_i^\text{T}) – (\beta_j X_j\Gamma_j+1_k\gamma_j^\text{T})\|^2\\ \newline \sum_{i=1}^{n}S^2(\beta_iX_i\Gamma_i+1_k\gamma_i^\text{T})=\sum_{i=1}^{n}S^2(X_i)

Above means that the full GPA minimizes the sum of pairwise squared Frobenius distances between all the transformed versions of the configurations ( (full Procrustes fits) in the sample. It should be noted that, the output of the full GPA is configurations that are translated, rotated, and scaled versions of the original configurations. Hence, the scale, location, and rotational information of each configuration is removed. The output configurations are registered onto each other such that the sum of pairwise squared Frobenius distances between all of them is minimized. If the scale of each output configuration is removed by re-scaling it to unit size, then full Procrustes distance between each two shapes (of the output configurations), and also the estimated mean shape (once re-scaled to unit size) can be calculated.

Remark: the full GPA does not lead to the minimized (squared) Frobenius distance between two full Procrustes fits , but it leads to the minimized sum of the Frobenius distances of each full Procrustes fit and the rest of them in the set. It does minimize the (squared) Frobenius distance between each full Procrustes fit and the (full Procrustes) mean shape though.

Remark: The full GPA removes the scale of the configurations by scaling them in order to make them closer to the mean shape. The scales are not the same for all the configurations and hence the full GPA does not preserve the (original) scale (relative size) of the configurations. The full Procrustes fits and the estimated mean shape may not have the unit size. However, they are relatively scaled in a way to minimize the sum of squared distances. This is the notion of removing the scale information (of the original configurations). Therefore, the Procrustes fits represent the shape (information) of the configurations and the differences between each of them (between corresponding landmarks) and the mean shape is purely due to their shape variations.

In summary, the full GPA of configurations leads to a mean shape and full Procrustes fits such that each full Procrustes fit is superimposed onto the mean shape through the similarity transformations; hence, whatever geometrical variations/discrepancies remains between each full Procrustes fit and the mean shape is purely due to their shapes.

An algorithm for full GPA is as follows:

1- Translations/Removing locations: remove the location of each configuration by centring each configuration (each configuration is translated to the origin by its centroid) and initially let each Procrustes fits be:

X_i^{FP}= CX_i\text{ for } i=1,…,n

2- estimate \hat\mu as \frac{1}{n}\sum_{i=1}^{n} X_i^{FP}.

3- Rotations and scales: perform a FOPA on each centred configuration X_i^{FP} to rotate and scale it to \hat\mu. Now each recently transformed configuration, X_i^{*FP} , is fitted to the estimated mean.

{X}_i^{*FP}=\hat\beta_i X_i^{FP} \hat\Gamma_i

\hat\beta_i and \hat\Gamma_i are transformation parameters out of the FOPA. Note that the rotation is about the origin, and no translation is needed as the locations are already removed.

4- Update the mean shape: \hat\mu=\frac{1}{n}\sum_{i=1}^{n} {X}_i^{*FP}.

5- let X_i^{FP}={X}_i^{*FP}. Re-centring is not necessary because the centroid of each configuration at this stage is already located at the origin and the rotations are also about the origin. Re-scaling won’t change the location of the centroid because it is isotropic.

6- Repeat 3 to 5 until everyone’s happy, i.e. the change in the Procrustes sum of squares (\sum_{i=1}^{n} \|\beta_iX_i\Gamma_i+1_k\gamma_i^\text{T}-\mu\|^2) from one iteration to the next one becomes less than a particular tolerance.

Remark: The configurations can be pre-scaled to have the unit size but it is not necessary because their size is removed once they get scaled to better fit the estimated mean at each iteration.

Partial GPA

Partial GPA involves merely registration of a set of configurations by translation and rotation (not scaling). Partial GPA preserves the scale/size of the configurations as opposed to full GPA. This method is appropriate for analysis of forms, i.e. size-and-shapes. Partial GPA is through minimization of the following sum with respect to the transformation parameters (\Gamma_i, \gamma_i) and and unknown form \mu. Size constraint is not needed as there is no scaling involved.

G_P(X_1,…X_n):=\sum_{i=1}^{n} \|X_i\Gamma_i+1_k\gamma_i^\text{T}-\mu\|^2\tag{19}

Once the above minimization is solved, the partial Procrustes fit of each configuration onto the common form is:


Similar to Eq. 17 and 18, we can write:

\hat\mu=\frac{1}{n}\sum_{i=1}^{n} X_i^{PP}\tag{21}

However, \hat\mu is not going to have the same shape as the mean shapes previously defined (sample full/partial Procrustes mean shape). \hat\mu is then a form that its Frobenius distance to each partial Procrustes fit is minimum. I and some dinosaur think that this partial GPA is appropriate to group-register a set of configurations onto each other, however, we don not know its application at this moment.

The algorithm for partial GPA is the same as that of the full GPA except that the scale is removed i.e. set as 1.

* The definitions follow the book “statistical shape analysis, Ian L. Dryden, Kanti V. Mardia”, and the article “Procrustes Methods in the Statistical Analysis of Shape, Colin Goodall”. I may not have paraphrased them all, hence, some sentences may be subjected to copy right and they are not intended to be reused in official publications.