Methods

As shown in Figure 1, the research consists of five main steps: the curation of perovskite amines database, the two-level classifications of perovskite amines, the chemical informatics and machine learning computations, the search for toxicity data, and the visualization of Amine Atlas.
Curation of perovskite amines database. The amines corresponding to the perovskite ammonium cations mentioned in recently published reviews6,24 and database25are sorted into the “existing perovskite amines” list (e.g., ethylamine corresponds to ethyl ammonium). This list is the basis of the perovskite amines database. Next, the database is further expanded by including “potential perovskite amines”, which have similar structures to existing amines. The PubChem similar structure search is performed on each existing perovskite amine, and the similarity threshold is 95%. By removing ions, non-amines, and existing structures in the database, the amines in the search results are further screened. In addition, these amines must have been tested by at least one activity assay in PubChem BioAssay database23,26 in order to be included as potential perovskite amines. The above search and filtering steps are completed by our programming tool based on the open-source Python packages PubChemPy (https://pubchempy.readthedocs.io/en/latest/) and RDKit (https://www.rdkit.org/docs/api-docs.html). In the final database, each amine has its PubChem Compound ID (CID), name, SMILES, and a list of its corresponding PubChem Bioassay ID (AID). The CID and name of their corresponding ammonium cations are also included.
Two-level classification. The amines are first classified according to their aromaticity and the position of their amine group (e.g. on aromatic ring, directly attached to the aromatic ring, or on the alkyl substituent of the aromatic ring). Further subclass classifications are established based on more detailed structures such as functional groups and linearity of alkyl chains). The identification of functional groups and chemical fragments is achieved through our RDKit-based programming tool (provided in supplementary material). It is worth noting that the purpose of classification here is not to establish a new standard of amine classification but to distinguish the amines in our database as much as possible.
Chemical informatics and machine learning computations. The MHFP6 fingerprint27 is calculated for each perovskite amine molecule using the open-source Python package MHFP (https://github.com/reymond-group/mhfp). The dimensionality of the fingerprint is then reduced by Uniform Manifold Approximation and Projection (UMAP)28 with the open-source Python package UMAP (https://umap-learn.readthedocs.io/en/latest/index.html). The parameters of these two tools, including the number of permutations of MHFP and the number of neighborhoods and the minimum distance of UMAP, are optimized for the clustering of different amine classes and subclasses. The data processing steps during the computations are achieved with the open-source Python packages Pandas (https://pandas.pydata.org/docs/) and Scikit-learn29. The code for the computations is provided in the supplementary material.
Search for toxicity data. The detailed information of all the bioassays with AID recorded in our perovskite amines database is retrieved from PubChem Bioassay Database23,26 using our PubChemPy-based programming tool. Only the bioassays with more than one perovskite amine showing “active” are kept. In addition, the bioassays showing bioactivity other than toxicity are eliminated. Finally, a table of PubChem Bioassays with their AID, number of perovskite amines tested as active, and assay name is obtained.
Visualization of Amine Atlas. The Amine Atlas is visualized using Plotly, a Python open-source graphing library (https://plotly.com/python/). The Amine Atlas can be viewed with or without amine toxicity data. In Amine Atlas, each data point represents an amine—UMAP calculation results are used as the two-dimensional coordinates of the data point, and the classification results or the hit ratio of the compound is displayed in the color of the data point. The detailed information of the corresponding compounds of the data point is displayed in the hover data tab, such as CID, SMILES, type (existing or potential), and classifications. All Amine Atlas shown in the following sections have corresponding interactive versions in the supplementary material.