Databases preparation
Database 1, containing 29360 amino acid sequences, was built with NCBI
BLAST at the following request:
- Database: non-redundant protein sequences (nr);
- Algorithm: blastp (protein-protein BLAST);
- Expected threshold: 10-20.
Database 2, containing 2416 nucleotide sequences, was created using the
NCBI Gene Database
(https://www.ncbi.nlm.nih.gov/gene/).
The original databases contained a lot of junk items such as duplicates
of certain sequences and truncated sequences.
Truncated sequences were excluded according to the following criteria:
450 aa residues ≤ number of amino acid residues in Hsp60 (amino acid
sequences) ≤ 650 aa residues, 1350 bp ≤ number of nucleotides in Hsp60
gene (nucleotide sequences) ≤ 1950 bp. The criterion 99% ≤ PID ≤ 100%
was used to remove duplicate amino acid sequences, where the PID is the
percent identity of two compared sequences.