ABSTRACT
The revelation of protein folding is a challenging subject in both discovery and description. Except acquirement of accurate 3D structure for protein stable state, another big hurdle is how to discover structural flexibility for protein innate character. Even if a huge number of flexible conformations are known, difficulty is how to describe these conformations. A novel approach, protein structure fingerprint, has been developed to expose the comprehensive local folding variations, and then construct folding conformations for entire protein. The backbone of 5 amino acid residues was identified as a universal folden, and then a set of Protein Folding Shape Code (PFSC) was derived for completely covering folding space in alphabetic description. Sequentially, a database was created to collect all possible folding shapes of local folding variations for all permutation of 5 amino acids. Successively, Protein Folding Variation Matrix (PFVM) assembled all possible local folding variations along sequence for a protein, which possesses several prominent features. First, it showed the fluctuation with certain folding patterns along sequence which revealed how the protein folding was related the order of amino acids in sequence. Second, all folding variations for an entire protein can be simultaneously apprehended at a glance within PFVM. Third, all conformations can be determined by local folding variations from PFVM, so total number of conformations is no longer ambiguous for any protein. Finally, the most possible folding conformation and its 3D structure can be acquired according PFVM for protein structure prediction. Therefore, the protein structure fingerprint approach provides a significant means for investigation of protein folding problem.
INTRODUCTION
Protein folding is one of the challenging subjects in science,11Science 1 July:Vol. 309 no. 5731 pp. 78-102 (2005).,22Dill K A, Maccallum J L. The protein-folding problem, 50 years on.[J]. Science, 338(6110):1042 (2012). which particularly has attracted much attention since recent progress by AlphaFold. With artificial intelligence (AI) approach, AlphaFold made a significant breakthrough to accurately predict 3D structures based on protein sequence.33 J. Jumper, etc. ”Highly accurate protein structure prediction with AlphaFold”. Nature. 596 (7873): 583–589 (2021). ,44Callaway, Ewen. ”’It will change everything’: DeepMind’s AI makes gigantic leap in solving protein structures”. Nature. 588 (7837): 203–204 (2020).  However, the protein folding problem has not been thoroughly resolved yet because protein is not a static structure.55 Stephen Curry, No, DeepMind has not solved protein folding, Reciprocal Space (blog), 2 December (2020).,66Balls, Phillip. ”Behind the screens of AlphaFold”. Chemistry World, (2020). The intrinsically disordered protein (IDP) has already discovered that many proteins lacked a fixed three-dimensional structure, and many protein functions were accomplished with ensemble of flexible conformations.77Robin van der Lee and etc, Classification of intrinsically disordered regions and proteins.[J]. Chemical Reviews, 2014, 114(13):6589.,88Dunker AK, Lawson JD, Brown CJ, Williams RM, Romero P, Oh JS, Oldfield CJ, Campen AM, Ratliff CM, Hipps KW, Ausio J, Nissen MS, Reeves R, Kang C, Kissinger CR, Bailey RW, Griswold MD, Chiu W, Garner EC, Obradovic Z (2001). ”Intrinsically disordered protein”. Journal of Molecular Graphics & Modelling. 19 (1): 26–59.,99Dyson HJ, Wright PE (March 2005). ”Intrinsically unstructured proteins and their functions”. Nature Reviews Molecular Cell Biology. 6 (3): 197–208,1010Dunker AK, Silman I, Uversky VN, Sussman JL (December 2008). ”Function and structure of inherently disordered proteins”. Current Opinion in Structural Biology. 18 (6): 756–64.  Thus, except for accuracy in structure, biologists also want to know how many different ways the protein will fold into, why protein are in such folds and what biologic functions are impacted by folding. To date, it is well known that the protein folding patterns are primarily decided by global multiple weak interactions of protein itself, such as hydrogen bond, disulfide bond, van der Waals force, electrostatic interactions and hydrophobic interactions, etc. Also, it is influenced by environment factors, such as protein-protein interactions, ligands, ions, solvent,pH,temperature and chaperones, etc. Under constraints, a protein can still fold into various conformations between random coil and native state, or undergo reversible folding process between disorder and order transitions. In 1957, Francis Crick indicated that protein folding was simply a function of the order of amino acids.1111Cobb M. 60 years ago, Francis Crick changed the logic of biology[J]. Plos Biology, 15(9):e2003243, (2017). It is true that different order of amino acids or replacement of residue in sequence may cause the change in folding conformation. In 1969, Cyrus Levinthal indicated that protein may have an astronomical number of local minima in conformational space,1212Levinthal, C. How to Fold Graciously. In Mossbauer ̵̈ Spectroscopy in Biological Systems, pp 22-24, Allerton House, Monticello, IL (1969). and further pointed out to understand the relationship from sequence to protein folding was a challenging problem. The basic task includes how to obtain all possible folding conformations, how to present these folding conformations with an astronomical number and how to acquire the most possible conformation in stable state and its 3D structure.
In spite of the lack of systematical approach, some protein conformations are still known. The protein conformations may be obtained by protein 3D structures which are experimental measurement data or results of computational approaches. Experimental measurements, such as X-ray crystallography, Nuclear Magnetic Resonance (NMR) and Transmission Electron Cryomicroscopy (CryoTEM) etc., may accurately determine atomic coordinates of protein 3D structures. However, they only provided the limited folding conformations for protein stable states under specific conditions. Anyhow, the results from experiments are snapshots of protein structures which provide significant folding information, but they couldn’t cover the enormous conformational space. Also, the progress of experimental measurements for protein 3D structures cannot keep up with the pace of rapid increase of knowledge of protein sequences as a huge number of protein sequences are determined by genetic code. To date, over 35,000,000 gene codes are available in National Center for Biotechnology Information (NCBI) database,1313https://www.ncbi.nlm.nih.gov/ and over 225,000,000 protein sequences in Universal Protein Resource (UniProt) database.1414http://www.uniprot.org/ So far, merely about 187,000 of 3D structures are available in Protein Data Bank (PDB).1515https://www.rcsb.org/ In other words, less than 1% of total protein sequences have the known protein 3D structures. Therefore, on the other hand, the development of computational approaches becomes an important methodology to predict the protein 3D structures. The effort of protein structure prediction, however, is primarily focusing on to achieve structures with thermodynamic stability, not multiple states for various folding conformations. In view of protein structure flexibility, many databases have cumulated information about protein or sequence regions involving intrinsically disordered protein (IDP).1616Lazar, T., Martı́nez-Pérez, E., Quaglia, F., Hatos, A., Chemes, L.B., Iserte, J.A., Méndez, N.A., Garrone, N.A., Saldaño, T.E., Marchetti, J., Velez Rueda, A.J., Bernadó, P., Blackledge, M., Cordeiro, T.N., Fagerberg, E., Forman-Kay, J.D., Fornasari, M.S., Gibson, T.J., Gomes, G-N.W., Gradinaru, C.C., Head-Gordon, T., Ringkjøbing Jensen, M., Lemke, E.A., Longhi, S., Marino-Buslje, C., Minervini, G., Mittag, T., Monzon, A.M., Pappu, R.V., Parisi, G., Ricard-Blum, S., Ruff, K.M., Salladini, E., Skepö, M., Svergun, D., Vallet, S.D., Varadi, M., Tompa, P., Tosatto, S.C.E., Piovesan D., PED in 2021: a major update of the Protein Ensemble Database for intrinsically disordered proteins,Nucleic Acids Research, Volume 49, Issue D1, (2021) D404–D411,1717Damiano Piovesan, Marco Necci, Nahuel Escobedo, Alexander Miguel Monzon, András Hatos …Nucleic Acids Research, Volume 49, Issue D1, 8 January 2021, Pages D361–D367,1818Fukuchi, Satoshi et al. “IDEAL: Intrinsically Disordered proteins with Extensive Annotations and Literature.” Nucleic acids research vol. 40, Database issue (2012): D507-11.,1919Federica Quaglia, Bálint Mészáros, Edoardo Salladini, András Hatos, Rita Pancsa … DisProt in 2022: improved quality and accessibility of protein intrinsic disorder annotation Nucleic Acids Research , Volume 50, Issue D1, 7 January 2022, Pages D480–D487,2020Damiano Piovesan, Marco Necci, Nahuel Escobedo, Alexander Miguel Monzon, András Hatos, Ivan Mičetić, Federica Quaglia, Lisanna Paladin, Pathmanaban Ramasamy, Zsuzsanna Dosztányi, Wim F Vranken, Norman E Davey, Gustavo Parisi, Monika Fuxreiter, Silvio C E Tosatto, MobiDB: intrinsically disordered proteins in 2021, Nucleic Acids Research , Volume 49, Issue D1, 8 January 2021, Pages D361–D367 The definitions of IDP are based on annotations of experimental data coming mainly from Nuclear Magnetic Resonance (NMR), Small-angle X-ray Scattering (SAXS) measurements and Molecular Dynamics (MD) simulations. However, an optimal approach for protein folding should obtain all possible folding conformations, expose folding difference between regions within a protein or between different proteins, including mutation or differentiation. Also, the most possible conformation and 3D structure should be extracted from a massive number of conformations.
Here, the protein structure fingerprint as novel approach to reveals the protein folding variations as well as the most possible conformation. A folden of element of 5 amino acid residues is firstly defined to probe the attribute of local folds, and then the local folds are extended to entire protein system to discover all possible folding conformations. First, a folden with 5 points connection as ball-and-stick is initiative model and make mathematical derivation. Without biological structure constrain, all folds are equivalently around each join point with topological uniformity, and all possible folds in geometric space form a complete and continuous aggregation. Second, the continuous aggregation of folding description is simplified by partition of space to reduce variable dimensions, and is applied to protein biological space. Then a set of 27 folding shapes are obtained which is able completely to cover various folding patterns for 5 successive amino acid residues. Third, with alphabetic description, these 27 folding shapes are represented by using 26 letters and “$” symbol. Thus, the topological model mathematically established the foundation to describe the protein backbone folding. As a set of 27 letters is applied to protein systems, it is called as the Protein Folding Shape Code (PFSC),2121Yang J, Comprehensive description of protein structures using protein folding shape code. Proteins;71.3:1497-1518 (2008). which essentially represent the folding shapes of 5 amino acid residues.
For protein with known 3D structure, its complete folding conformation can be described by PFSC string. The folding shape of any set of 5 successive amino acid residues is identified by a PFSC letter according the given coordinates of alpha carbon atoms. Along sequence from N-terminus to C-terminus, the conformation can be described by a string of PFSC letters. As one PFSC letter represents a folding shape of 5 successive amino acid residues, and two adjacent PFSC letters actually share partial of folding shape overlap of 4 amino acid residues. Thus any protein folding conformations can be completely described by a PFSC string, covering regular secondary structure fragments as well as irregular tertiary structure fragments.
For protein without given 3D structure, the comprehensive folding conformations for a protein are able to be exposed by local folding variations. In order to achieve this goal, all of possible permutations for 5 amino acids as well as all possible local folding shapes for each 5 amino acids are needed to well know. There are total 3,200,000 permutations for 5 amino acids based on 20 standard amino acids. For the permutations of 5 amino acids available in PDB, all folds have been first primarily collected. Then, for the permutations of 5 amino acids not available in PDB, their 3D structures were calculated by molecular dynamics simulations, and the folding shapes were obtained. Consequently, a new database 5AAPFSC, where the folding shapes of 5 amino acids are described by PFSC, is created to assemble all possible folding shapes for each permutations of 5 amino acid residues. A set of 5 amino acids may have one or more than one PFSC letters, but no more than 27 PFSC assignments. Each set of 5 amino acids may have different folding patterns and different number of PFSC letters. Therefore, according sequence only, all the possible folding shapes for each successive 5 amino acids from N-terminus to C-terminus can be thoroughly represented by continue sets of PSFC letters. The local folding variations are displayed in Protein Folding Variation Matrix (PFVM)2222Yang J. Protein Structure Fingerprint Technology. J Bioinform, Genomics, Proteomics X: 3(2): 1036, (2008). which the protein sequence is listed horizontally, and all folding shapes in the PFSC letters for each 5 successive amino acids are displayed vertically. The PFVM provides rich information to promote protein folding investigation. First, for a protein, the comprehensive local folding variations along sequence are simultaneously exhibited by PFVM. Second, the local folding variations are fluctuated with the folding pattern and number. Third, all possible conformations with an astronomical number for a protein can be assembled with various combinations of local folding variations, and the most possible conformation and 3D structure can be easily determined. Finally, the protein structure fingerprint produces the ensemble of conformations to probe the protein structures as well as the application in biological drug design and disease research.2323Yang J & Lee WH, Protein Structure Alphabetic Alignment, Protein Structure, Edited by Eshel Faraggi, InTech Publishers, (ISBN 978-953-51-0555-8), 133-156 (2012).,2424Yang J, Wu G, From Sequence to Protein Folding Variations. Biomedical Journal of Scientific & Technical Research, (2019).,2525Yang J, Zhang P, Cheng W X, et al. Exposing Structural Variations in SARS-CoV-2 Evolution. Scientifc Reports, 11:22042, (2021). Thus, the protein structure fingerprint provided a signification foundation for the solution of protein folding and applications.
METHODS
Protein Folding Shape Code (PFSC).
With protein folding fingerprint, the PFSC alphabetic string can provide a complete description for protein conformation. Mathematically, 5 points with successive connection in geometric space was firstly considered as a topological folding model. With derivation, the initial higher dimensions of topological folding space were reduced and the continuous space was partitioned, and then a set of 27 folding shapes was obtained which are able completely to cover various folding patterns for 5 points in sequential connection. These folding shapes can be representing with 27 letters including “$” symbol as a digitized expression. However, for biological protein, a set of 5 successive amino acid residues may not actually have all 27 folding shapes due to structural constrain. With alphabetical expression, the 27 folding shapes for 5 successive amino acid residues are called by Protein Folding Shape Code (PFSC) and displayed in the cubic of Figure 1. With integration feature, any PFSC letter in the cubic has partial folding similarity with its surrounding neighboring letters, and then a 27 PFSC may be transformed from one to each other by neighboring similarity. For example, the letter “A” represents a typical alpha-helix, and “H, D, V, Y, J and P” around “A” respectively has partial alpha-helical character in folds. In the same column with “A”, “H” is an extensive helix fold and “D” is a compressive helix fold respectively. Nearby “A”, “V, Y, J or P” has partial helical fold in N- or C-terminus respectively. The letter “B” represents a typical beta-strand, and “E, G, V, J, M and S” around “B” respectively has partial beta-strand in folds. Other remaining letters relate to irregular folds for tertiary folds, and also have partial similarity with neighboring letter in the cubic. Briefly, none of PFSC is isolated, and all 27 PFSC letters are formed into a meaningful ensemble with structural relevance. The blue arrows in Figure 1 indicated that the folding conformation for a protein with known 3D structure is able to be completely described by a PFSC string without gap. The PFSC letter is assigned to the folding shape of each 5 consecutive residues from N-terminus to C- terminus, and two PFSC letters next each other share folding shape of four amino acids. In summary, for any protein with given 3D structure, the PFSC alphabetical string is able to completely describe the folding conformation, covering secondary structure fragments as well as tertiary structure fragments.