Manifold Warping: Manifold Alignment Over Time

Published in AAAI'12 Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence
07/22/2012

Abstract

Knowledge transfer is computationally challenging, due in part to the curse of dimensionality, compounded by source and target domains expressed using different features (e.g., documents written in different languages). Recent work on manifold learning has shown that data collected in real-world settings often have high-dimensional representations, but lie on low-dimensional manifolds. Furthermore, data sets collected from similar generating processes often present different high-dimensional views, even though their underlying manifolds are similar. The ability to align these data sets and extract this common structure is critical for many transfer learning tasks. In this paper, we present a novel framework for aligning two sequentially-ordered data sets, taking advantage of a shared low-dimensional manifold representation. Our approach combines traditional manifold alignment and dynamic time warping algorithms using alternating projections. We also show that the previously-proposed canonical time warping algorithm is a special case of our approach. We provide a theoretical formulation as well as experimental results on synthetic and real-world data, comparing manifold warping to other alignment methods.

The advent of large, often high-dimensional, digital data sets has made automated knowledge extraction a critical research focus in the field of machine learning. Often, we find real-world sequential data sets that encode the same information with disparate surface feature representations, such as sensor network data, activity and object recognition corpora, and audio/video streams. In these cases, an automated technique for discovering correlations between sets will allow easy transfer of knowledge from one domain to another, avoiding costly or infeasible re-learning. In this paper, we present a framework that combines manifold alignment (Ham 2005, Wang 2009) and dynamic time warping (DTW) (Sakoe 1978) for aligning two such sequential data sets. We also show that the previously proposed method of canonical time warping (CTW) (Zhou 2009) is a special case of our approach.

Dynamic time warping has been used effectively for time-series alignment, but it requires an inter-set distance function, which usually implies that both input data sets must have the same dimensionality. DTW may also fail under arbitrary affine transformations of one or both inputs.

CTW aims to solve these two problems by alternating between canonical correlation analysis (CCA) (Anderson 2003) and DTW until convergence, as illustrated in Figure \ref{fig:CTWillustration}. In the case where inputs are of different dimensions, CTW first projects both data sets into a shared space using principal component analysis (PCA) (Jolliffe 2002). The algorithm does not always converge to a global optimum, but CTW still improves the performance of the alignment when compared to applying DTW directly. However, CTW fails when the two related data sets require nonlinear transformations to uncover the shared manifold space. We illustrate such a case in Figure \ref{fig:NonlinearManifoldIllustration}, in which two sequential data sets are two \(sin(x)\) curves; one lying on a plane and the other on a Swiss roll. In this case, the CCA projection that CTW relies on will fail to unroll the second curve, making a good DTW alignment impossible.

We first provide a brief overview of manifold alignment and dynamic time warping before combining the two to formulate manifold warping.

In manifold alignment (Wang 2009), we are given two data sets \(X \in \mathbb{R}^{n_{X} \times d_{X} }\) and \(Y \in \mathbb{R}^{n_{Y} \times d_{Y}}\) where \(\mathbb{R}^{m \times n}\) denotes a \(m\) by \(n\) real matrix. In \(X\) and \(Y\), each row is an *instance*. \(W^{(X)} \in \mathbb{R}^{n_{X} \times n_{X} }\) and \(W^{(Y)} \in \mathbb{R}^{n_{Y} \times n_{Y}}\) are similarity matrices that provide the similarities between instances in \(X\) and \(Y\), respectively. These two matrices are usually constructed as adjacency matrices of nearest neighbor graphs, optionally applying a heat kernel function. In addition, we also have a warping matrix \(W^{(X,Y)} \in \mathbb{R}^{n_{X} \times n_{Y}}\) that specifies the correspondences between instances in X and Y. Typically, \[W^{(X,Y)}_{i,j}=\left\{ \begin{array}{ll}
1 & \mbox{if $X_{i}$ corresponds to $Y_{j}$} \\
0 & \mbox{otherwise}
\end{array}
\right.\] Suppose we have a mapping that maps \(X,Y\) to \(F^{(X)} \in \mathbb{R}^{n_{X}\times d},F^{(Y)}\in \mathbb{R}^{n_{Y}\times d}\) in a latent space with dimension \(d\leq min(d_{x},d_{y})\). Note that in terms of the underlying matrix representation, any row \(i\) of \(X\) is mapped to row \(i\) of \(F^{(X)}\), and a similar relation holds for \(Y\) and \(F^{(Y)}\). We form the following loss function for the mapping as follows. The first term indicates that corresponding points across data sets should remain close to each other in the embedding. The last two terms specify that, within an input set, points close in the original space should remain close in the embedding. The factor \(\mu\) controls how much we want to preserve inter-set correspondences versus local geometry. \[\begin{aligned}
L_1\left(F^{(X)},F^{(Y)}\right) =
\mu\sum_{i \in X, j \in Y} {||F^{(X)}_{i}-F^{(Y)}_{j}||^2 W^{(X,Y)}_{i,j}} \nonumber \\
+ (1-\mu) \sum_{i,j \in X} {||F^{(X)}_{i} - F^{(X)}_{j}||^2 W^{(X)}_{i,j} } \nonumber \\
+ (1-\mu) \sum_{i,j \in Y} {||F^{(Y)}_{i} - F^{(Y)}_{j}||^2 W^{(Y)}_{i,j} } \end{aligned}\] The notation \(i \in X\) simply means \(1 \leq i \leq n_{X}\) . We can combine \(W^{(X)}\),\(W^{(Y)}\), and \(W^{(X,Y)}\) into a joint similarity matrix \(W\): \[W=\left[\begin{array}{cc}
(1-\mu) W^{\left(X\right)} & \mu W^{\left(X,Y\right)}\\
\mu W^{\left(Y,X\right)} & (1-\mu) W^{\left(Y\right)}
\end{array}\right]\] Then, we combine \(F^{(X)},F^{(Y)}\) into \(F\) where \(F= \left[ \begin{array}{l} F^{(X)} \\ F^{(Y)} \end{array} \right].\) Let \(F_{i,k}\) denote the element \((i,k)\) of \(F\) and \(F_{\cdot,k}\) denote the \(k\)th column of \(F\). Then the loss function can be rewritten: \[\begin{aligned}
L_{1}(F)= & \displaystyle \sum_{i,j} ||{F_{i}-F_{j}||^{2} W_{i,j} } \nonumber \\
=& \displaystyle \sum_{k} \sum_{i,j} ||{F_{i,k}-F_{j,k}||^{2} W_{i,j} } \nonumber \\
=& 2 \displaystyle \sum_{k} F_{\cdot,k}^{T} L F_{\cdot,k} = 2 tr\left(F^{T}LF\right)\end{aligned}\] The optimization problem becomes: \[\underset{F}{\arg\min} \left(L_{1}\right) = \underset{F}{\arg\min} \left(tr\left(F^{T}LF\right)\right)\]

This matches the optimization problem of Laplacian Eigenmaps (Belkin 2001), except that in this case the similarity matrix is a joint matrix produced from two similarity matrices. As with Laplacian Eigenmaps, we add a constraint \(F^{T}DF=I\) in order to remove an arbitrary scaling factor as well as to avoid a collapse to a subspace with dimension less than \(d\). For example, this constraint prevents the trivial mapping to a single point. The solution \(F=[f_{1}, f_{2}, ..., f_{d} ]\) is given by the \(d\) nonzero smallest eigenvectors of the general eigenvalue problem: \(Lf_i = \lambda D f_i\) for \(i=1,...,d\).

We can also restrict the mapping to be linear by instead solving the optimization problem: \[\label{eqn:linearmaeqn} \underset{\phi}{\arg\min} \left(tr\left(\phi^{T}V^{T}L\phi V\right)\right) \mbox { subject to $\phi^{T}V^{T}D\phi V=I$}\] \[V = \left( \begin{array}{cc} X & 0 \\ 0 & Y \end{array} \right)\] and \(\phi = \left[ \begin{array}{l} \phi^{(X)} \\ \phi^{(Y)} \end{array} \right] \) is the joint projection, \(L,D\) are the graph Laplacian and degree matrix of \(V\) respectively. The resultant linear embedding is then \(X\phi^{(X)}\) and \(Y\phi^{(Y)}\), instead of \(F^{(X)}\) and \(F^{(Y)}\). The solution for \(\phi = [\phi_{1},\phi_{2},...,\phi_{d}]\) is the \(d\) nonzero smallest eigenvectors of the general eigenvalue problem \(V^{T}LV\phi_{i}=\lambda V^{T}DV \phi_{i}\) for \(i = 0,...,d\).

We are given two sequential data sets \(X= [ x_{1}^{T},\ldots,x_{n}^{T} ]^{T} \in \mathbb{R}^{n \times d} \), \(Y=[ y_{1}^{T},\ldots,y_{m}^{T}]^{T} \in \mathbb{R}^{m \times d}\) in the same space with a distance function \(dist:X \times Y \rightarrow \mathbb{R}\). Let \(P=\{p_1,...,p_s\}\) represent an alignment between \(X\) and \(Y\), where each \(p_k=(i,j)\) is a pair of indices such that \(x_i\) corresponds with \(y_j\). Since the alignment is restricted to sequentially-ordered data, we impose the additional constraints:

\[\begin{aligned} p_{1} & = & (1,1) \label{eq:Wcontraint1} \\ p_{s} & = & (n,m) \label{eq:Wcontraint2} \\ p_{k+1}-p_{k} & = & (1,0)\: or\:(0,1)\: or\:(1,1) \label{eq:Wcontraint3}\end{aligned}\]

That is, an alignment must match the first and last instances and cannot skip any intermediate instance. This also yields the property that no two sub-alignments cross each other. Figure \ref{fig:dtw} is an example of a valid alignment. We can also represent the alignment in matrix form \(W\) where: \[W_{i,j}= \left\{ \begin{array}{ll} 1 & \mbox{if $(i,j) \in P$} \\ 0 & \mbox{otherwise} \end{array} \right.\]

To ensure that \(W\) represents an alignment which satisfies the constraints in Equations \ref{eq:Wcontraint1}, \ref{eq:Wcontraint2}, \ref{eq:Wcontraint3}, \(W\) must be in the following form: \(W_{1,1}=1,W_{n,m}=1\), none of the columns or rows of \(W\) is a \(0\) vector, and there must not be any \(0\) between any two \(1\)’s in a row or column of \(W\). We call a \(W\) which satifies these conditions a *DTW matrix*. An optimal alignment is the one which minimizes the loss function with respect to the DTW matrix \(W\): \[\begin{aligned}
L_2(W) = \sum_{i,j}dist\left(x_i,y_j\right)W_{i,j}\end{aligned}\]

A naïve search over the space of all valid alignments would take exponential time; however, dynamic programming can produce an optimal alignment in \(O(nm)\).

We now present a novel framework for aligning two sequentially-ordered data sets that share a common manifold representation. In our approach, we use the warping matrix produced by DTW as a heuristic correspondence matrix for manifold alignment. The proposed algorithm uses alternating projections (von Neumann 1950), picking new correspondences with DTW and reprojecting both inputs using manifold alignment until the loss function is minimized. This presents an improvement over CTW in cases where nonlinear transformations are required to recover the underlying manifold structure of one or both input data sets. We introduce the following loss function for manifold warping: \[\begin{array}{l} L_3 \left(F^{(X)},F^{(Y)},W^{(X,Y)}\right) = \\ \mu \displaystyle\sum_{i\in X,j\in Y}\Vert F_i^{(X)}-F_{j}^{(Y)}\Vert^2 W_{i,j}^{(X,Y)} \\ + (1-\mu) \displaystyle\sum_{i,j\in X}\Vert F_i^{(X)}-F_{j}^{(X)}\Vert^2 W_{i,j}^{(X)} \\ + (1-\mu) \displaystyle\sum_{i,j\in Y}\Vert F_i^{(Y)}-F_{j}^{(Y)}\Vert^2 W_{i,j}^{(Y)} \end{array}\] The optimization becomes \(\underset{F^{(X)},F^{(Y)},W^{(X,Y)} }{\operatorname*{\arg\!\min}} (L_{3})\) subject to \(F^{T}DF=I\) where \(F= \left[ \begin{array}{l} F^{(X)} \\ F^{(Y)} \end{array} \right]\) and \(W^{(X,Y)}\) is a DTW matrix. Note that unlike manifold alignment, the correspondence matrix \(W^{(X,Y)}\) is now an argument in the optimization problem. The intuition behind this loss function is similar to that of manifold alignment: the last two error terms ensure that the embedding preserves the local geometry of the inputs, and the first term promotes a high quality DTW alignment between two sequential data sets. Again, these goals are controlled by the parameter \(\mu\). We now propose an algorithm that minimizes \(L_3\):

In Algorithm 1, \(\mbox{MA(X,Y,W,d,$\mu$)}\) is a function that returns the embedding of \(X,Y\) in a \(d\) dimensional space using manifold alignment with the joint similarity matrix \(W\) and parameter \(\mu\) described in the manifold alignment section. The function \(\mbox{DTW(X,Y)}\) returns a DTW matrix after aligning two sequences \(X,Y\) using dynamic time warping. The \(\mbox{KNNGraph(X,k)}\) function returns the \(k\)-nearest neighbors graph of a data set \(X\). In some cases, it is useful to replace the \(k\)-nearest neighbor graph approach with an \(\epsilon\)-neighborhood graph (Belkin 2001).

Let \(L_{3,t}\) be the loss function \(L_3\) evaluated at \(F^{(X),t},F^{(Y),t},W^{(X,Y),t}\) of Algorithm 1. The sequence \(L_{3,t}\) converges to a mimimum as \(t \rightarrow \infty\). Therefore, Algorithm 1 will terminate.

In every iteration \(t\), two steps are performed: using manifold alignment to solve for new projections \(F^{\left(X\right),t+1},F^{\left(Y\right),t+1}\), and using DTW to change the correspondences to \(W^{(X,Y),t+1}\).

Recall that the loss function \(L_1\) is just \(L_3\) with fixed \(W^{(X,Y)}\). In the first step, with fixed \(W^{(X,Y),t}\), Algorithm 1 solves for new projections \(F^{(X),t+1},F^{(Y),t+1}\) using manifold alignment. In manifold alignment section, we showed that manifold alignment’s mappings minimize the loss function \(L_{3}\) when the correspondence matrix is fixed. Hence: \[\begin{array}{l} L_{3}(F^{(X),t+1},F^{(Y),t+1},W^{(X,Y),t}) \\ \leq L_{3}(F^{(X),t},F^{(Y),t},W^{(X,Y),t}) \end{array} \label{eqn:mainequality}\] In the second step, the projections are fixed as \(F^{(X),t+1},F^{(Y),t+1}\). Algorithm 1 changes the correspondence matrix from \(W^{(X,Y),t}\) to \(W^{(X,Y),t+1}\) which does not affect last two terms in \(L_3\). If we replace \(dist(F^{(X)}_{i},F^{(Y)}_{j})\) by \(\mu ||F^{(X),t+1}_{i}-F^{(Y),t+1}_{j}||^2\) in the loss function \(L_{2}\) of DTW, we recover the first term in \(L_{3}\) of manifold warping. Since \(W^{(X,Y),t+1}\) is produced by DTW, it will minimize the first term of \(L_3\). Therefore, we have: \[\begin{array}{c} \mu \displaystyle\sum_{i\in X,j\in Y} || F_{i}^{(X),t+1}-F_{j}^{(Y),t+1}||^{2}W_{i,j}^{(X,Y),t+1} \\ \leq \mu \displaystyle\sum_{i\in X,j\in Y} || F_{i}^{(X),t+1}-F_{j}^{(Y),t+1}||^{2}W_{i,j}^{(X,Y),t} \end{array} \label{eqn:dtwinequality}\] Changing the correspondence matrix does not affect the last two terms of \(L_3\), so: \[\begin{array}{l} L_{3}(F^{(X),t+1},F^{(Y),t+1},W^{(X,Y),t+1}) \\ \leq L_{3}(F^{(X),t+1},F^{(Y),t+1},W^{(X,Y),t}) \\ \leq L_{3}(F^{(X),t},F^{(Y),t},W^{(X,Y),t}) \mbox{ from inequality aureplacedverbaa } \\ \Leftrightarrow L_{3,t+1} \leq L_{3,t} \end{array}\] Therefore, \(L_{3,t}\) is a decreasing sequence. We also have \(L_{3,t} \ge 0\), so it is convergent. Therefore, Algorithm 1 will eventually terminate.

We now propose an algorithm that exploits the observation that if the local geometries of the two data sets are roughly the same, their similarity matrices will also be very similar to each other. (Wang 2008) Thus, if we first perform a nonlinear projection on each input set independently, the embeddings are likely to be linearly alignable using either manifold warping or CTW.

In Algorithm 2, \(\mbox{DimReduction}(X,W,d)\) is a dimensionality reduction function which maps \(X\) with similarity matrix \(W\) to a lower dimensional space \(d\). In this paper, we will use Laplacian Eigenmaps to be consistent with manifold alignment even though other methods such as LLE (Roweis 2000), Isomap (Tenenbaum 2000), etc. could be applied. \(\mbox{LMA(X,Y,W,d,$\mu$)}\) is a function that performs linear manifold alignment described above on \(X\) and \(Y\) with the joint similarity matrix \(W\), the target dimension \(d\) and returns the projection matrices \(\phi^{(X)}\) and \(\phi^{(Y)}\). We can think of \(\mbox{DimReduction}\) as a preprocessing step, then reformulate the loss function as: \[\begin{array}{l} L_{4} (\phi^{(X)},\phi^{(Y)},W^{(X,Y)}) \\= ((1-\mu)\displaystyle\sum_{i,j\in X} ||F_{i}^{(X)}\phi^{(X)}-F_{j}^{(X)}\phi^{(X)}||^{2}W_{i,j}^{(X)} \\ +(1-\mu)\displaystyle\sum_{i,j\in Y}||F_{i}^{(Y)}\phi^{(Y)}-F_{j}^{(Y)}\phi^{(Y)}||^{2}W_{i,j}^{(Y)}\\ +\mu \displaystyle\sum_{i\in X,j\in Y}||F_{i}^{(X)}\phi^{(X)}-F_{j}^{(Y)}\phi^{(Y)}||^{2}W_{i,j}^{(X,Y)} ) \end{array}\] which is the same loss function as in linear manifold alignment except that \(W^{(X,Y)}\) is now a variable. The two constraints are the constraint in Equation \ref{eqn:linearmaeqn} of linear manifold alignment, and \(W^{(X,Y)}\) must be a \(\mbox{DTW matrix}\).

Let \(L_{4,t}\) be the loss function \(L_4\) evaluated at \(\prod_{i=1}^{t}\phi^{(X),i},\prod_{i=1}^{t}\phi^{(Y),i},W^{(X,Y),t}\) of Algorithm 2. The sequence \(L_{4,t}\) converges to a mimimum as \(t \rightarrow \infty\). Therefore, Algorithm 2 will terminate.

The proof is similar to that of theorem 1. At any iteration \(t\), Algorithm 2 first fixes the correspondence matrix at \(W^{(X,Y),t}\). Now let \(L_4'\) be like \(L_4\) except that we replace \(F_{i}^{(X)},F_{i}^{(Y)}\) by \(F_{i}^{(X),t},F_{i}^{(Y),t}\) and Algorithm 2 minimizes \(L_4'\) over \(\phi^{(X),t+1},\phi^{(Y),t+1}\) using linear manifold alignment. Thus, \[\begin{array}{l} \label {eqn:linearinequality} L_{4}'(\phi^{(X),t+1},\phi^{(Y),t+1},W^{(X,Y),t}) \\ \leq L_{4}'(I,I,W^{(X,Y),t}) \\ = L_{4}(\prod_{i=1}^{t}\phi^{(X),i},\prod_{i=1}^{t} \phi^{(Y),i},W^{(X,Y),t}) \\ = L_{4,t} \end{array}\]

since \(F^{(X),t}=F^{(X)}\prod_{i=1}^{t}\phi^{(X),i}\) and \(F^{(Y),t}=F^{(Y)}\prod_{i=1}^{t}\phi^{(X),i}\). We also have: \[\begin{array}{l} L_{4}'(\phi^{(X),t+1},\phi^{(Y),t+1},W^{(X,Y),t}) \\ = L_{4}(\prod_{i=1}^{t+1}\phi^{(X),i},\prod_{i=1}^{t+1} \phi^{(Y),i},W^{(X,Y),t}) \\ \leq L_{4,t} \end{array}\]

Algorithm 2 then performs DTW to change \(W^{(X,Y),t}\) to \(W^{(X,Y),t+1}\). Using the same argument as in the proof of Theorem 1, we have: \[\begin{array}{l} L_{4}(\prod_{i=1}^{t+1}\phi^{(X),i},\prod_{i=1}^{t+1} \phi^{(Y),i},W^{(X,Y),t+1}) \\ \leq L_{4}(\prod_{i=1}^{t+1}\phi^{(X),i},\prod_{i=1}^{t+1} \phi^{(Y),i},W^{(X,Y),t}) \\ \leq L_{4,t}\\ \Leftrightarrow L_{4,t+1} \leq L_{4,t}. \end{array}\] So, the convergence follows.

Furthermore, when we set \(\mu=1\), the loss function \(L_{4}\) will become similar to that of CTW. We can also substitute CTW in place of the loop in the algorithm.

We compare the performance of CTW and manifold warping by trying to align two \(sin(x^{2})\) curves: one is on the flat plane, the another is projected onto the Swiss roll as illustrated in Figure \ref{fig:NonlinearManifoldIllustration}. Some duplicate points are added along the curves to create many-to-one correspondences in the alignment.

As shown in Figure \ref{fig:SytheticEmbeddings}, manifold warping produced similar embeddings for two curves based on their local geometry while CTW linearly collapsed the Swiss roll curve onto the plane.

As a result, the warping path (that is, the alignment path) produced by manifold warping stays closer to the true warping path than that produced by CTW. The error is calculated by the area between the result path and the ground truth path as suggested in (Zhou 2009). We also normalize the error by dividing by the whole plot area, \(n_{X}\times n_{Y}\).

The warping paths and the calculated errors, shown in Figure \ref{fig:SyntheticWarpingPaths} and Table \ref{tab:AlignmentErrors}, show that manifold warping yields a smaller error than CTW.

We also test these algorithms on a real-world vision data set from the Columbia Object Image Library (COIL100) (citation not found: Coil100). The corpus consists of different series of images taken of different objects on a rotating platform. Each series has \(72\) images, each \(128\times128\) pixels. We try to align two series of images of two different objects, with differences in shape and brightness producing very different high-dimensional representations. To demonstrate our algorithm’s ability to work with data sets of different dimensionality, we compress one image series to a smaller resolution (\(64\times64\) pixels). Additionally, some duplicate images are added to each series, to ensure that the correct mapping is not trivially one-to-one.

In both experiments, manifold warping methods achieve alignments with a much smaller error than CTW. The depiction in Figure \ref{fig:DogAndCatEmbeddings} provides an intuitive picture of the manifold warping algorithm. In the first projection to two dimensions, both image series are mapped to circles. The next several iterations rotate these circles to match the first and last points, then the points in between. For the case of one-step Manifold Warping (where all mappings are nonlinear), we pick a small \(\mu\) to prioritize preserving local geometry of each series. This avoids over-fitting the embedding to a potentially bad intermediate DTW correspondence.

We perform the experiment with two pairs of COIL image series, illustrated in Figure \ref{fig:COIL_series}.

Synthetic | Dog+Cat | Cups | Kitchen | |
---|---|---|---|---|

1-step MW | 0.0768 | 0.0447 | 0.0464 | 0.0257 |

2-step MW | 0.0817 | 0.0282 |
0.0125 |
0.0396 |

2-step + CTW | 0.0652 |
0.0298 | 0.0143 | 0.0772 |

CTW | 0.2784 | 0.2656 | 0.1668 | 0.0966 |

\label{tab:AlignmentErrors}

Our last experiment uses the kitchen data set (De la Torre 2008) from the CMU Quality of Life Grand Challenge, which records human subjects cooking a variety of dishes. Here, we attempt nonlinear alignments between the same subject and task, across different sensors.

Our experiment considers two separate views of the same moment in time, during which the subject prepares a brownie. The two views are 9-dimensional inertial measurement unit (IMU) readings and 87-dimensional motion capture suit coordinates (MOCAP). Aligning two views of the same task provides a straightforward evaluation metric, because the time stamps on each reading yield ground-truth correspondence information. To make the problem computationally feasible, we subsampled the original data sets. Each manifold warping method performs better than CTW, based on the results shown below in Figure \ref{fig:SandwichEmbeddings}, Figure \ref{fig:SandwichWarpingPaths}, and Table \ref{tab:AlignmentErrors}.

Due to the lack of a linearity constraint, manifold warping consistently performs better than CTW when the inputs lie on manifolds that are not accessible via linear transformations. Even in the linear case, the alignment quality of manifold warping is at just as good as CTW.

Importantly, this improved alignment quality does not impose significant runtime overhead. Both algorithms rely on the same DTW step, and tuned implementations of manifold alignment are comparable in runtime to the CCA step used in CTW. We found that while each manifold warping iteration is marginally slower than a similar CTW iteration, manifold warping tends to converge with fewer steps. Speedups may be possible by using a relaxed variation of DTW for the first few iterations, and parallelizing the initial alignments of the two-step algorithm.

Both the one-step and two-step manifold warping algorithms are natural extensions of canonical time warping on manifolds, and the results presented in this paper indicate that this added information has the potential to significantly improve alignment quality.

Several variants of DTW and manifold alignment may prove beneficial within the manifold warping framework. Future work will explore multiscale and local alignment strategies, enabling broader applications for manifold warping.

Jihun Ham, Daniel D. Lee, Lawrence K. Saul. Semisupervised alignment of manifolds.

*10 th International Workshop on Artificial Intelligence and Statistics***120-127**(2005).C. Wang, S. Mahadevan. A general framework for manifold alignment. In

*AAAI Fall Symposium on Manifold Learning and its Applications*. (2009).H. Sakoe, S. Chiba. Dynamic programming algorithm optimization for spoken word recognition.

*Acoustics, Speech and Signal Processing, IEEE Transactions on***26**, 43–49 IEEE, 1978.F. Zhou, F. De La Torre. Canonical time warping for alignment of human behavior.

*Advances in Neural Information Processing Systems (NIPS)*(2009).T.W. Anderson. An introduction to multivariate statistical analysis. Wiley-Interscience, 2003. Link

I.T. Jolliffe. Principal component analysis. Springer-Verlag, 2002. Link

M. Belkin, P. Niyogi. Laplacian eigenmaps and spectral techniques for embedding and clustering.

*Advances in neural information processing systems***14**, 585–591 (2001).J. von Neumann. Functional Operators. Vol. II. The geometry of orthogonal spaces, volume 22 (reprint of 1933 notes) of Annals of Math.

*Studies. Princeton University Press*(1950).C. Wang, S. Mahadevan. Manifold alignment using Procrustes analysis. 1120–1127 In

*Proceedings of the 25th international conference on Machine learning*. (2008).S.T. Roweis, L.K. Saul. Nonlinear dimensionality reduction by locally linear embedding.

*Science***290**American Association for the Advancement of Science, 2000.J. B. Tenenbaum, V. De Silva, J. C. Langford. A global geometric framework for nonlinear dimensionality reduction.

*Science*(2000).F. De la Torre, J. Hodgins, A. Bargteil, X. Martin, J. Macey, A. Collado, P. Beltran. Guide to the Carnegie Mellon University multimodal activity (CMU-MMAC) database. Carnegie Mellon University, 2008.