Results and Discussion
The first part of our curated perovskite amine database consists of 184
amines that correspond to the ammonium cations in literature, named
“existing perovskite amines”. The structural similarity search on
PubChem and further screening process give an additional 264 amines that
are considered “potential perovskite amines” —the amines that
have similar structures to existing ones. Finally, the curated
perovskite amine database contains 448 amine structures. The full table
of the database is provided in the supplementary material. The main
reason for expanding the database is to make full use of data on amines
that have been tested for bioactivity or toxicity, regardless of whether
they have been studied as perovskite amines. As more amines are included
in the analysis, it may be easier to find toxicity trends and their
relationship to the amine structure.
The introduction of artificial intelligence to the creation of Amine
Atlas and toxicity screening of amine chemistries involves the
calculation of MinHash fingerprint, up to six bonds (MHFP6) and Uniform
Manifold Approximation and Projection (UMAP). MHFP6 is an improved
version of the extended connectivity fingerprint
(ECFP)27 that lowers the dimensionality needed to
describe the detailed molecular substructures as well as increases the
performance of the nearest neighbor search.30 The
MHFP6 fingerprint has been used in recently published chemistry
databases31,32 and data visualization
tool33 with big data settings. MinHash is a locality
sensitive hashing (LSH) scheme that applies a family of hashing
functions to the substrings in molecular shingling and stores the
minimum hash generated from each hashing function in a set. These sets,
containing the minimum hash values, have the interesting property that
they can be indexed by an LSH algorithm for approximate nearest neighbor
search (ANN), removing the curse of dimensionality.30MinHash allows for the indexing of chemical structures in extremely
sparse Jaccard (Tanimoto) space, a metric more appropriate for
fingerprint-based similarity calculations. 30 On the
other hand, UMAP is a recently
developed non-linear dimensionality reduction
algorithm28 that has been used to analyze various
types of scientific data, mainly in the field of biological sciences
including genome aggregation34, single-cell mass flow
cytometry35, and single-cell RNA sequencing
(scRNA-seq)35-37. UMAP is a manifold learning method
that preserves local and global structure of the high-dimensional data
points by minimizing data/information loss. It explores the network
connectivity using K-nearest neighbor distance (KNN) over a
high-dimensional hyperplane and then estimates a low-dimensional
coordinate system that replicates the same graph structure, preserving
the edge connectivity of the high-dimensional by keeping graphical
representation intact in the low-dimensional space. Compared with the
more frequently used t -distributed stochastic neighborhood
embedding (t-SNE) algorithm which has limited capability to represent
the global structure of the data, it is found that UMAP retains the
local and global structure of the data by simultaneously capturing the
small differences and the continuity between the data subsets.
The higher level of classification gets amines categorized into
aliphatic amines (cyclic and noncyclic),
heterocyclic aromatic amines, and
other aromatic amines including phenylalkyl amines and anilines.
Combining this classification information with the results of the UMAP
on the MHFP6 fingerprint of perovskite amines, the clustering of these
amine classes can be observed on Amine Atlas. The optimized clustering
is reached when MHFP permutation number, UMAP number of neighbors, UMAP
minimum distance are set to 2048, 50, and 0.25, respectively. Using this
combination of parameters, the main classes are well-separated from each
other on the Amine-Atlas (Figure 2), and the same parameters are used
for all the Amine Atlas below.
For each amine class, the Amine-Atlas can display further
classifications as subclasses. The subclasses of heterocyclic aromatic
amines are shown in Figure 3. This class of amines is clearly divided
into common nitrogen-containing aromatics, including pyrrole, imidazole,
pyridine, and thiazole, and sulfur-containing thiophene. No overlap is
observed between the clusters, which may be due to the effectiveness of
MHFP6 fingerprint in capturing the characteristics of common aromatic
compounds.
Similarly, for the class of phenylalkyl amines, the subclasses are
well-separated in Amine-Atlas (Figure 4). This figure shows the power of
UMAP in capturing both the local and global structure of the data. Here,
the UMAP captures subtle differences between subclasses (such as those
with the same carbon number) by dividing them into different clusters
(e.g. 1-phenylethylamines
(C6H5-C(C)NH3) and
phenylethylamines
(C6H5-CCNH3)). At the
same time, the UMAP shows the continuity of close subclasses by placing
them in adjacent positions, such as the benzylamines
(C6H5-CNH3) and
phenylethylamines
(C6H5-CCNH3) whose alkyl
substituents differ in chain length by 1.
Due to the complex structure of branched alkyl chains, the noncyclic
aliphatic amines have some clusters with less organization (Figure 5).
However, the trend still exists in the amines with linear alkyl chains,
such as the linear diamines (purple) and linear monoamines (orange)
subclasses, where the length of the alkyl chain decreases along the
UMAP-1 axis. In addition, amines that have functional groups in addition
to amine groups (dark green) are distant from unsubstituted amines
(purple and orange).
One important purpose of this study is to screen the relative hazard of
amines being used in 2D and 3D perovskite synthesis – those most
hazardous and those not so. We retrieve the toxicity data of perovskite
amines from PubChem Bioassay Database23,26, an
open-source repository holding a collection of bioactivity and toxicity
data of small molecules—these molecules are cross-linked to the
data of their chemical structures stored in PubChem Compounds
Database22. After a search using our programming
tools, we summarized a list of PubChem Bioassays that focus on the
toxicity of chemicals and in the meantime include perovskite amines as
test substances, and the complete list of assays is provided in the
supplementary material. Examples of the toxicity effects and
corresponding AID are shown in Table 1.
Table 1. Examples of selected PubChem Bioassays and the
toxicity effect they study