Methods
As shown in Figure 1, the research consists of five main steps: the
curation of perovskite amines database, the two-level classifications of
perovskite amines, the chemical informatics and machine learning
computations, the search for toxicity data, and the visualization of
Amine Atlas.
Curation of perovskite amines database. The amines
corresponding to the perovskite ammonium cations mentioned in recently
published reviews6,24 and database25are sorted into the “existing perovskite amines” list (e.g.,
ethylamine corresponds to ethyl ammonium). This list is the basis of the
perovskite amines database. Next, the database is further expanded by
including “potential perovskite amines”, which have similar structures
to existing amines. The PubChem similar structure search is performed on
each existing perovskite amine, and the similarity threshold is 95%. By
removing ions, non-amines, and existing structures in the database, the
amines in the search results are further screened. In addition, these
amines must have been tested by at least one activity assay in PubChem
BioAssay database23,26 in order to be included as
potential perovskite amines. The above search and filtering steps are
completed by our programming tool based on the open-source Python
packages PubChemPy
(https://pubchempy.readthedocs.io/en/latest/)
and RDKit
(https://www.rdkit.org/docs/api-docs.html).
In the final database, each amine has its PubChem Compound ID (CID),
name, SMILES, and a list of its corresponding PubChem Bioassay ID (AID).
The CID and name of their corresponding ammonium cations are also
included.
Two-level classification. The amines are first classified
according to their aromaticity and the position of their amine group
(e.g. on aromatic ring, directly attached to the aromatic ring, or on
the alkyl substituent of the aromatic ring). Further subclass
classifications are established based on more detailed structures such
as functional groups and linearity of alkyl chains). The identification
of functional groups and chemical fragments is achieved through our
RDKit-based programming tool (provided in supplementary material). It is
worth noting that the purpose of classification here is not to establish
a new standard of amine classification but to distinguish the amines in
our database as much as possible.
Chemical informatics and machine learning computations. The
MHFP6 fingerprint27 is calculated for each perovskite
amine molecule using the open-source Python package MHFP
(https://github.com/reymond-group/mhfp).
The dimensionality of the fingerprint is then reduced by Uniform
Manifold Approximation and Projection (UMAP)28 with
the open-source Python package UMAP
(https://umap-learn.readthedocs.io/en/latest/index.html).
The parameters of these two tools, including the number of permutations
of MHFP and the number of neighborhoods and the minimum distance of
UMAP, are optimized for the clustering of different amine classes and
subclasses. The data processing steps during the computations are
achieved with the open-source Python packages Pandas
(https://pandas.pydata.org/docs/)
and Scikit-learn29. The code for the computations is
provided in the supplementary material.
Search for toxicity data. The detailed information of all the
bioassays with AID recorded in our perovskite amines database is
retrieved from PubChem Bioassay Database23,26 using
our PubChemPy-based programming tool. Only the bioassays with more than
one perovskite amine showing “active” are kept. In addition, the
bioassays showing bioactivity other than toxicity are eliminated.
Finally, a table of PubChem Bioassays with their AID, number of
perovskite amines tested as active, and assay name is obtained.
Visualization of Amine Atlas. The Amine Atlas is visualized
using Plotly, a Python open-source graphing library
(https://plotly.com/python/).
The Amine Atlas can be viewed with or without amine toxicity data. In
Amine Atlas, each data point represents an amine—UMAP calculation
results are used as the two-dimensional coordinates of the data point,
and the classification results or the hit ratio of the compound is
displayed in the color of the data point. The detailed information of
the corresponding compounds of the data point is displayed in the hover
data tab, such as CID, SMILES, type (existing or potential), and
classifications. All Amine Atlas shown in the following sections have
corresponding interactive versions in the supplementary material.