(6)
where N is the total number of datapoints, and pij and qij denote the joint probability in high-
and lowdimensional space. The first adjustment simplifies the form of
loss function improving training efficiency. The second adjustment
effectively alleviated the ”crowding problem”, a common challenge
faced by SNE and many other dimensionality reduction techniques
(Goodfellow et al., 2016). Datapoints that lie moderately distant in
high-dimensional space tend to be crushed together in the embedded
space, which prevents gaps from forming between natural clusters. Using
a heavy-tailed distribution (i.e. Student t-distribution) to calculate
pairwise similarities for low-dimensional representation effectively
alleviates the crowding problem and preserves the local and global
structure of datapoints in the embedded space. In all trials of van der
Maaten and Hinton (2008), t-SNE produced considerably better
visualization than other embedding techniques, including SNE, Sammon
mapping, curvilinear components analysis, Isomap, maximum variance
unfolding, locally linear embedding, and Laplacian Eigenmaps.
Datapoint arrangement on t-SNE maps are sensitive to two parameters: 1)
perplexity and 2) learning rate. The perplexity is considered a smooth
measure of the effective number of neighbors and is used to determineσi of the Gaussian distribution for
high-dimensional datapoints (van der Maaten and Hinton, 2008).
Intuitively speaking, it controls the ”roundness” of the arrangement
of the datapoints in the embedded space. Conventionally, perplexity is
chosen from a range between 5 and 50, with van der Maaten and Hinton
(2008) recommending a perplexity of 30. Learning rate (η ) is an
optimization that determines the final convergence of loss function. In
this study, we follow the process of van der Maaten and Hinton (2008)
and choose perplexity as 30 and η as 100.
Results from the t-SNE map were compared with Principle Component
Analysis (PCA), likely the most popular linear embedding technique and
evaluated based on the separability of ADHs with distinct flow patterns.
A subset of ADHs were labelled with flow regime type as detalied
earlier, and then projected on the 2D map using t-SNE and PCA
respectively. K-Nearest Neighbors (KNN, Goldberger et al., 2005) was
used to classify datapoints on both embedded maps, with k set to 30.
Favorable embedding techniques should arrange the ADHs of different
types into separable clusters on 2D map and allow accurate
classification with simple classifiers (e.g. KNN). Here, we employed
classification accuracy as a quantitative indicator for separability of
datapoints, and better embedding technique should result in higher
accuracy. Tools for t-SNE, PCA, and KNN are all available in
scikit-learn package in Python (Pedregosa et al., 2011).
As a non-parametric method, t-SNE does not produce the mapping function
between high- and lowdimensional data representation. It is impossible
to project additional data on to an existing t-SNE map, which has been
recognized as a major shortcoming of t-SNE technique (van der Maaten and
Hinton, 2008). One solution is to merge new data to the original dataset
and re-run t-SNE, yet it is computationally inefficient for large
datasets. In addition, datapoints in the original dataset could be
displaced on the new t-SNE map as inclusion of new data alters the
similarity matrix (i.e. pairwise joint probabilities). The location
inconsistency of embedded datapoint is highly unfavourable. To address
this, van der Maaten (2009) proposed an alternative solution that build
an parametric t-SNE through incorporating an autoencoder. sub
Autoencoder
An autoencoder is a neural network that is trained to copy inputs to
outputs through the use of an encoder and decoder. The encoder functionh = f (x ) converts input data x to
latent features h , while the decoder function r =g (h ) reconstruct data from latent features to its
original format. As a lossy technique, autoencoders are not trained to
copy perfectly, but to transfer the most salient information of input
data to latent features (with less dimensions) while ignoring noise.
This approach is widespread in dimensional reduction and feature
learning (Goodfellow et al., 2016).
Encoder networks are often considered a universal function
approximators. They are trained to approximate complicated, non-linear
functions that map high-dimensional data to low-dimensional
representation. Here, we use an encoder network to approximate the
mapping between ADHs and t-SNE 2D data. Consequently, newly collected
ADHs can be projected on the existing t-SNE map using the trained
encoder, and there is no need to re-run t-SNE with the entire dataset.
The objective is to minimize the distances between the datapoints
projected by the t-SNE and encoder. Mean Absolute Error (MAE) was
employed as the loss function.
The performance of the encoder network largely depends on the choice of
model hyperparameters, which are referred to as untrainable parameters
as they do not change during the training procedure. The network
architecture (i.e. number of layers and nodes) and activation function
are critical hyperparameters for the encoder. In an iterative manner, we
tested a variety of architectures and activation functions in order to
search for the optimal model configuration (see Table 2). Network
architecture defines the depth (i.e. number of layers) and width (i.e.
number of nodes) of the encoder network. Our baseline architecture is a
three-layer fully-connected network, with 512, 256, and 128 nodes for
each layer. Activation functions bring non-linearity to neural networks,
and enable the encoder model to approximate complicated, non-linear
functions. The choice of activation function affects the optimization
process (Goodfellow et al., 2016), and we selected the Rectified Linear
Unit (ReLU, as f (x ) = max(0,x )) as the default
function for the baseline model Goodfellow et al. (2016). A modified
version of ReLU, called leaky ReLU (as f (x ) = max(a∗ x,x )), has been recommended as providing additional network
optimization benefits (? ), and a number were tested with tuneda ranging from 0.02 to 0.4. A dropout layer (Srivastava et al.,
2014) was added after dense layers to avoid overfitting. The optimizer
was chosen as Adam (Kingma and Ba, 2015) for all tested encoders. The
encoder networks were built using Keras (2.4.3 version), a Python
package for deep learning (Chollet et al., 2015).