(6)
where N is the total number of datapoints, and pij and qij denote the joint probability in high- and lowdimensional space. The first adjustment simplifies the form of loss function improving training efficiency. The second adjustment effectively alleviated the ”crowding problem”, a common challenge faced by SNE and many other dimensionality reduction techniques (Goodfellow et al., 2016). Datapoints that lie moderately distant in high-dimensional space tend to be crushed together in the embedded space, which prevents gaps from forming between natural clusters. Using a heavy-tailed distribution (i.e. Student t-distribution) to calculate pairwise similarities for low-dimensional representation effectively alleviates the crowding problem and preserves the local and global structure of datapoints in the embedded space. In all trials of van der Maaten and Hinton (2008), t-SNE produced considerably better visualization than other embedding techniques, including SNE, Sammon mapping, curvilinear components analysis, Isomap, maximum variance unfolding, locally linear embedding, and Laplacian Eigenmaps.
Datapoint arrangement on t-SNE maps are sensitive to two parameters: 1) perplexity and 2) learning rate. The perplexity is considered a smooth measure of the effective number of neighbors and is used to determineσi of the Gaussian distribution for high-dimensional datapoints (van der Maaten and Hinton, 2008). Intuitively speaking, it controls the ”roundness” of the arrangement of the datapoints in the embedded space. Conventionally, perplexity is chosen from a range between 5 and 50, with van der Maaten and Hinton (2008) recommending a perplexity of 30. Learning rate (η ) is an optimization that determines the final convergence of loss function. In this study, we follow the process of van der Maaten and Hinton (2008) and choose perplexity as 30 and η as 100.
Results from the t-SNE map were compared with Principle Component Analysis (PCA), likely the most popular linear embedding technique and evaluated based on the separability of ADHs with distinct flow patterns. A subset of ADHs were labelled with flow regime type as detalied earlier, and then projected on the 2D map using t-SNE and PCA respectively. K-Nearest Neighbors (KNN, Goldberger et al., 2005) was used to classify datapoints on both embedded maps, with k set to 30. Favorable embedding techniques should arrange the ADHs of different types into separable clusters on 2D map and allow accurate classification with simple classifiers (e.g. KNN). Here, we employed classification accuracy as a quantitative indicator for separability of datapoints, and better embedding technique should result in higher accuracy. Tools for t-SNE, PCA, and KNN are all available in scikit-learn package in Python (Pedregosa et al., 2011).
As a non-parametric method, t-SNE does not produce the mapping function between high- and lowdimensional data representation. It is impossible to project additional data on to an existing t-SNE map, which has been recognized as a major shortcoming of t-SNE technique (van der Maaten and Hinton, 2008). One solution is to merge new data to the original dataset and re-run t-SNE, yet it is computationally inefficient for large datasets. In addition, datapoints in the original dataset could be displaced on the new t-SNE map as inclusion of new data alters the similarity matrix (i.e. pairwise joint probabilities). The location inconsistency of embedded datapoint is highly unfavourable. To address this, van der Maaten (2009) proposed an alternative solution that build an parametric t-SNE through incorporating an autoencoder. sub

Autoencoder

An autoencoder is a neural network that is trained to copy inputs to outputs through the use of an encoder and decoder. The encoder functionh = f (x ) converts input data x to latent features h , while the decoder function r =g (h ) reconstruct data from latent features to its original format. As a lossy technique, autoencoders are not trained to copy perfectly, but to transfer the most salient information of input data to latent features (with less dimensions) while ignoring noise. This approach is widespread in dimensional reduction and feature learning (Goodfellow et al., 2016).
Encoder networks are often considered a universal function approximators. They are trained to approximate complicated, non-linear functions that map high-dimensional data to low-dimensional representation. Here, we use an encoder network to approximate the mapping between ADHs and t-SNE 2D data. Consequently, newly collected ADHs can be projected on the existing t-SNE map using the trained encoder, and there is no need to re-run t-SNE with the entire dataset. The objective is to minimize the distances between the datapoints projected by the t-SNE and encoder. Mean Absolute Error (MAE) was employed as the loss function.
The performance of the encoder network largely depends on the choice of model hyperparameters, which are referred to as untrainable parameters as they do not change during the training procedure. The network architecture (i.e. number of layers and nodes) and activation function are critical hyperparameters for the encoder. In an iterative manner, we tested a variety of architectures and activation functions in order to search for the optimal model configuration (see Table 2). Network architecture defines the depth (i.e. number of layers) and width (i.e. number of nodes) of the encoder network. Our baseline architecture is a three-layer fully-connected network, with 512, 256, and 128 nodes for each layer. Activation functions bring non-linearity to neural networks, and enable the encoder model to approximate complicated, non-linear functions. The choice of activation function affects the optimization process (Goodfellow et al., 2016), and we selected the Rectified Linear Unit (ReLU, as f (x ) = max(0,x )) as the default function for the baseline model Goodfellow et al. (2016). A modified version of ReLU, called leaky ReLU (as f (x ) = max(ax,x )), has been recommended as providing additional network optimization benefits (? ), and a number were tested with tuneda ranging from 0.02 to 0.4. A dropout layer (Srivastava et al., 2014) was added after dense layers to avoid overfitting. The optimizer was chosen as Adam (Kingma and Ba, 2015) for all tested encoders. The encoder networks were built using Keras (2.4.3 version), a Python package for deep learning (Chollet et al., 2015).