ABSTRACT
The revelation of protein folding is a challenging subject in both
discovery and description. Except acquirement of accurate 3D structure
for protein stable state, another big hurdle is how to discover
structural flexibility for protein innate character. Even if a huge
number of flexible conformations are known, difficulty is how to
describe these conformations. A novel approach, protein structure
fingerprint, has been developed to expose the comprehensive local
folding variations, and then construct folding conformations for entire
protein. The backbone of 5 amino acid residues was identified as a
universal folden, and then a set of Protein Folding Shape Code (PFSC)
was derived for completely covering folding space in alphabetic
description. Sequentially, a database was created to collect all
possible folding shapes of local folding variations for all permutation
of 5 amino acids. Successively, Protein Folding Variation Matrix (PFVM)
assembled all possible local folding variations along sequence for a
protein, which possesses several prominent features. First, it showed
the fluctuation with certain folding patterns along sequence which
revealed how the protein folding was related the order of amino acids in
sequence. Second, all folding variations for an entire protein can be
simultaneously apprehended at a glance within PFVM. Third, all
conformations can be determined by local folding variations from PFVM,
so total number of conformations is no longer ambiguous for any protein.
Finally, the most possible folding conformation and its 3D structure can
be acquired according PFVM for protein structure prediction. Therefore,
the protein structure fingerprint approach provides a significant means
for investigation of protein folding problem.
INTRODUCTION
Protein folding is one of the challenging subjects in science,11Science
1 July:Vol. 309 no. 5731 pp. 78-102 (2005).,22Dill K A,
Maccallum J L. The protein-folding problem, 50 years on.[J].
Science, 338(6110):1042 (2012). which particularly has attracted much
attention since recent progress by AlphaFold. With artificial
intelligence (AI) approach, AlphaFold made a significant breakthrough to
accurately predict 3D structures based on protein sequence.33 J.
Jumper, etc. ”Highly accurate protein structure prediction with
AlphaFold”. Nature. 596 (7873): 583–589 (2021). ,44Callaway,
Ewen. ”’It will
change everything’: DeepMind’s AI makes gigantic leap in solving
protein structures”. Nature. 588 (7837): 203–204 (2020). However,
the protein folding problem has not been thoroughly resolved yet because
protein is not a static structure.55 Stephen
Curry, No,
DeepMind has not solved protein folding, Reciprocal Space (blog), 2
December (2020).,66Balls,
Phillip. ”Behind
the screens of AlphaFold”. Chemistry World, (2020). The
intrinsically disordered protein (IDP) has already discovered that many
proteins lacked a fixed three-dimensional structure, and many protein
functions were accomplished with ensemble of flexible
conformations.77Robin van der Lee and etc, Classification of
intrinsically disordered regions and proteins.[J]. Chemical
Reviews, 2014, 114(13):6589.,88Dunker AK, Lawson JD, Brown
CJ, Williams RM, Romero P, Oh JS, Oldfield CJ, Campen AM, Ratliff CM,
Hipps KW, Ausio J, Nissen MS, Reeves R, Kang C, Kissinger CR, Bailey
RW, Griswold MD, Chiu W, Garner EC, Obradovic Z (2001). ”Intrinsically
disordered protein”. Journal of Molecular Graphics &
Modelling. 19 (1): 26–59.,99Dyson
HJ, Wright PE (March 2005). ”Intrinsically unstructured proteins and
their functions”. Nature Reviews Molecular Cell Biology. 6 (3):
197–208,1010Dunker AK, Silman I, Uversky VN, Sussman JL
(December 2008). ”Function and structure of inherently disordered
proteins”. Current Opinion in Structural Biology. 18 (6): 756–64.
Thus, except for accuracy in structure, biologists also want to know how
many different ways the protein will fold into, why protein are in such
folds and what biologic functions are impacted by folding. To date, it
is well known that the protein folding patterns are primarily decided by
global multiple weak interactions of protein itself, such as hydrogen
bond, disulfide bond, van der Waals force, electrostatic interactions
and hydrophobic interactions, etc. Also, it is influenced by environment
factors, such as protein-protein interactions, ligands, ions,
solvent,pH,temperature and chaperones, etc. Under constraints, a
protein can still fold into various conformations between random coil
and native state, or undergo reversible folding process between disorder
and order transitions. In 1957, Francis Crick indicated that protein
folding was simply a function of the order of amino acids.1111Cobb
M. 60 years ago, Francis Crick changed the logic of biology[J].
Plos Biology, 15(9):e2003243, (2017). It is true that different order
of amino acids or replacement of residue in sequence may cause the
change in folding conformation. In 1969, Cyrus Levinthal indicated that
protein may have an astronomical number of local minima in
conformational space,1212Levinthal, C. How to Fold Graciously. In
Mossbauer ̵̈ Spectroscopy in Biological Systems, pp 22-24, Allerton
House, Monticello, IL (1969). and further pointed out to understand
the relationship from sequence to protein folding was a challenging
problem. The basic task includes how to obtain all possible folding
conformations, how to present these folding conformations with an
astronomical number and how to acquire the most possible conformation in
stable state and its 3D structure.
In spite of the lack of systematical approach, some protein
conformations are still known. The protein conformations may be obtained
by protein 3D structures which are experimental measurement data or
results of computational approaches. Experimental measurements, such as
X-ray crystallography, Nuclear Magnetic Resonance (NMR) and Transmission
Electron Cryomicroscopy (CryoTEM) etc., may accurately determine atomic
coordinates of protein 3D structures. However, they only provided the
limited folding conformations for protein stable states under specific
conditions. Anyhow, the results from experiments are snapshots of
protein structures which provide significant folding information, but
they couldn’t cover the enormous conformational space. Also, the
progress of experimental measurements for protein 3D structures cannot
keep up with the pace of rapid increase of knowledge of protein
sequences as a huge number of protein sequences are determined by
genetic code. To date, over 35,000,000 gene codes are available in
National Center for Biotechnology Information (NCBI) database,1313https://www.ncbi.nlm.nih.gov/
and over 225,000,000 protein sequences in Universal Protein Resource
(UniProt) database.1414http://www.uniprot.org/ So far, merely
about 187,000 of 3D structures are available in Protein Data Bank
(PDB).1515https://www.rcsb.org/ In other words, less than 1% of
total protein sequences have the known protein 3D structures. Therefore,
on the other hand, the development of computational approaches becomes
an important methodology to predict the protein 3D structures. The
effort of protein structure prediction, however, is primarily focusing
on to achieve structures with thermodynamic stability, not multiple
states for various folding conformations. In view of protein structure
flexibility, many databases have cumulated information about protein or
sequence regions involving intrinsically disordered
protein (IDP).1616Lazar, T., Martı́nez-Pérez, E., Quaglia, F.,
Hatos, A., Chemes, L.B., Iserte, J.A., Méndez, N.A., Garrone, N.A.,
Saldaño, T.E., Marchetti, J., Velez Rueda, A.J., Bernadó, P.,
Blackledge, M., Cordeiro, T.N., Fagerberg, E., Forman-Kay, J.D.,
Fornasari, M.S., Gibson, T.J., Gomes, G-N.W., Gradinaru, C.C.,
Head-Gordon, T., Ringkjøbing Jensen, M., Lemke, E.A., Longhi, S.,
Marino-Buslje, C., Minervini, G., Mittag, T., Monzon, A.M., Pappu,
R.V., Parisi, G., Ricard-Blum, S., Ruff, K.M., Salladini, E., Skepö,
M., Svergun, D., Vallet, S.D., Varadi, M., Tompa, P., Tosatto, S.C.E.,
Piovesan D., PED in 2021: a major update of the Protein Ensemble
Database for intrinsically disordered proteins,Nucleic Acids
Research, Volume 49, Issue D1, (2021) D404–D411,1717Damiano
Piovesan, Marco
Necci, Nahuel
Escobedo, Alexander
Miguel Monzon, András Hatos …Nucleic Acids Research, Volume 49,
Issue D1, 8 January 2021, Pages D361–D367,1818Fukuchi, Satoshi
et al. “IDEAL: Intrinsically Disordered proteins with Extensive
Annotations and Literature.” Nucleic acids research vol. 40, Database
issue (2012): D507-11.,1919Federica Quaglia, Bálint
Mészáros, Edoardo Salladini, András Hatos, Rita Pancsa …
DisProt
in 2022: improved quality and accessibility
of protein intrinsic disorder annotation Nucleic Acids Research ,
Volume 50, Issue D1, 7 January 2022, Pages D480–D487,2020Damiano
Piovesan, Marco Necci, Nahuel Escobedo, Alexander Miguel Monzon,
András Hatos, Ivan Mičetić, Federica Quaglia, Lisanna Paladin,
Pathmanaban Ramasamy, Zsuzsanna Dosztányi, Wim F Vranken, Norman E
Davey, Gustavo Parisi, Monika Fuxreiter, Silvio C E Tosatto, MobiDB:
intrinsically disordered proteins in 2021, Nucleic Acids
Research , Volume 49, Issue D1, 8 January 2021, Pages D361–D367 The
definitions of IDP are based on annotations of experimental data coming
mainly from Nuclear Magnetic Resonance (NMR), Small-angle X-ray
Scattering (SAXS) measurements and Molecular Dynamics (MD) simulations.
However, an optimal approach for protein folding should obtain all
possible folding conformations, expose folding difference between
regions within a protein or between different proteins, including
mutation or differentiation. Also, the most possible conformation and 3D
structure should be extracted from a massive number of conformations.
Here, the protein structure fingerprint as novel approach to reveals the
protein folding variations as well as the most possible conformation. A
folden of element of 5 amino acid residues is firstly defined to probe
the attribute of local folds, and then the local folds are extended to
entire protein system to discover all possible folding conformations.
First, a folden with 5 points connection as ball-and-stick is initiative
model and make mathematical derivation. Without biological structure
constrain, all folds are equivalently around each join point with
topological uniformity, and all possible folds in geometric space form a
complete and continuous aggregation. Second, the continuous aggregation
of folding description is simplified by partition of space to reduce
variable dimensions, and is applied to protein biological space. Then a
set of 27 folding shapes are obtained which is able completely to cover
various folding patterns for 5 successive amino acid residues. Third,
with alphabetic description, these 27 folding shapes are represented by
using 26 letters and “$” symbol. Thus, the topological model
mathematically established the foundation to describe the protein
backbone folding. As a set of 27 letters is applied to protein systems,
it is called as the Protein Folding Shape Code (PFSC),2121Yang J,
Comprehensive description of protein structures using protein folding
shape code. Proteins;71.3:1497-1518 (2008). which essentially
represent the folding shapes of 5 amino acid residues.
For protein with known 3D structure, its complete folding conformation
can be described by PFSC string. The folding shape of any set of 5
successive amino acid residues is identified by a PFSC letter according
the given coordinates of alpha carbon atoms. Along sequence from
N-terminus to C-terminus, the conformation can be described by a string
of PFSC letters. As one PFSC letter represents a folding shape of 5
successive amino acid residues, and two adjacent PFSC letters actually
share partial of folding shape overlap of 4 amino acid residues. Thus
any protein folding conformations can be completely described by a PFSC
string, covering regular secondary structure fragments as well as
irregular tertiary structure fragments.
For protein without given 3D structure, the comprehensive folding
conformations for a protein are able to be exposed by local folding
variations. In order to achieve this goal, all of possible permutations
for 5 amino acids as well as all possible local folding shapes for each
5 amino acids are needed to well know. There are total 3,200,000
permutations for 5 amino acids based on 20 standard amino acids. For the
permutations of 5 amino acids available in PDB, all folds have been
first primarily collected. Then, for the permutations of 5 amino acids
not available in PDB, their 3D structures were calculated by
molecular dynamics simulations, and the folding shapes were obtained.
Consequently, a new database 5AAPFSC, where the folding shapes of 5
amino acids are described by PFSC, is created to assemble all possible
folding shapes for each permutations of 5 amino acid residues. A set of
5 amino acids may have one or more than one PFSC letters, but no more
than 27 PFSC assignments. Each set of 5 amino acids may have different
folding patterns and different number of PFSC letters. Therefore,
according sequence only, all the possible folding shapes for each
successive 5 amino acids from N-terminus to C-terminus can be thoroughly
represented by continue sets of PSFC letters. The local folding
variations are displayed in Protein Folding Variation Matrix
(PFVM)2222Yang J. Protein Structure Fingerprint Technology. J
Bioinform, Genomics, Proteomics X: 3(2): 1036, (2008). which the
protein sequence is listed horizontally, and all folding shapes in the
PFSC letters for each 5 successive amino acids are displayed vertically.
The PFVM provides rich information to promote protein folding
investigation. First, for a protein, the comprehensive local folding
variations along sequence are simultaneously exhibited by PFVM. Second,
the local folding variations are fluctuated with the folding pattern and
number. Third, all possible conformations with an astronomical number
for a protein can be assembled with various combinations of local
folding variations, and the most possible conformation and 3D structure
can be easily determined. Finally, the protein structure fingerprint
produces the ensemble of conformations to probe the protein structures
as well as the application in biological drug design and disease
research.2323Yang J & Lee WH, Protein Structure Alphabetic
Alignment, Protein Structure, Edited by Eshel Faraggi, InTech
Publishers, (ISBN 978-953-51-0555-8), 133-156 (2012).,2424Yang
J, Wu G, From Sequence to Protein Folding Variations. Biomedical
Journal of Scientific & Technical Research, (2019).,2525Yang
J, Zhang P, Cheng W X, et al. Exposing Structural Variations in
SARS-CoV-2 Evolution. Scientifc Reports, 11:22042, (2021). Thus, the
protein structure fingerprint provided a signification foundation for
the solution of protein folding and applications.
METHODS
Protein Folding Shape Code (PFSC).
With protein folding fingerprint, the PFSC alphabetic string can provide
a complete description for protein conformation. Mathematically, 5
points with successive connection in geometric space was firstly
considered as a topological folding model. With derivation, the initial
higher dimensions of topological folding space were reduced and the
continuous space was partitioned, and then a set of 27 folding shapes
was obtained which are able completely to cover various folding patterns
for 5 points in sequential connection. These folding shapes can be
representing with 27 letters including “$” symbol as a digitized
expression. However, for biological protein, a set of 5 successive amino
acid residues may not actually have all 27 folding shapes due to
structural constrain. With alphabetical expression, the 27 folding
shapes for 5 successive amino acid residues are called by Protein
Folding Shape Code (PFSC) and displayed in the cubic of Figure 1. With
integration feature, any PFSC letter in the cubic has partial folding
similarity with its surrounding neighboring letters, and then a 27 PFSC
may be transformed from one to each other by neighboring similarity. For
example, the letter “A” represents a typical alpha-helix, and “H, D,
V, Y, J and P” around “A” respectively has partial alpha-helical
character in folds. In the same column with “A”, “H” is an extensive
helix fold and “D” is a compressive helix fold respectively. Nearby
“A”, “V, Y, J or P” has partial helical fold in N- or C-terminus
respectively. The letter “B” represents a typical beta-strand, and
“E, G, V, J, M and S” around “B” respectively has partial
beta-strand in folds. Other remaining letters relate to irregular folds
for tertiary folds, and also have partial similarity with neighboring
letter in the cubic. Briefly, none of PFSC is isolated, and all 27 PFSC
letters are formed into a meaningful ensemble with structural relevance.
The blue arrows in Figure 1 indicated that the folding conformation for
a protein with known 3D structure is able to be completely described by
a PFSC string without gap. The PFSC letter is assigned to the folding
shape of each 5 consecutive residues from N-terminus to C- terminus, and
two PFSC letters next each other share folding shape of four amino
acids. In summary, for any protein with given 3D structure, the PFSC
alphabetical string is able to completely describe the folding
conformation, covering secondary structure fragments as well as tertiary
structure fragments.