Discussion:
The SARS-CoV-2 pandemic has cost more than 333 thousand lives already, with many more deaths being reported each day. During this period, the global community has yet to predict its virulence, seasonal variation, carriage and immunity. However, it is clear that the fatality rate varies by region and that the degree of virulence varies from person to person. Some regions in Europe and North America were affected the most, while most of Asia, Africa and Australia remain less affected. A close analysis of this ssRNA genome has now become an elementary scientific need.
This study has characterized the SARS-CoV-2 virus circulating in Southeast Asia into 4 major groups and 2 sub-groups by studying common non-synonymous mutations. Group 1 consists of 5 out of 7 Indonesian sequences, 3 out of 8 sequences from Thailand and the only sequence from Nepal. Group 2 involves 40% of the variants in this study. Strains belong to this group coevolved with characteristic NS mutation, NSP12_P323L and Spike_D614G. These variants were initially prevalent in Europe and North America, and now constitute 68% of the virus all over the world. A recent study analyzed 95 sequences and also found NSP12_P323L variants to be at a higher frequency, and reported that this variant was mostly found outside of Wuhan, China (Khailany, Safdar, & Ozaslan, 2020). Another study suggests that RNA dependent RNA-polymerase (RdRp) aa substitution at the 323rd position (NSP12_P323L) causes RdRp fidelity, which, in turn, increases the number of mutations within the virus and causes co-evolved mutations (Pachetti et al., 2020). NSP12_P323L was co-evolved with Spike_D614G; this particular non-silent spike protein mutation generates an additional elastase cleavage site near the S1-S2 junction and thus facilitates fusion and cell entry (Koyama, Weeraratne, Snowdon, & Parida, 2020). This variant (Spike_D614G) was first observed in January 28, 2020 and was initially prevalent in Europe. Within 4 months, this variant has now rapidly outcompeted its ancestral subtype all over the world (Bhattacharyya et al., 2020). This explains the frequency of Group 2 variants in Southeast Asia and why these variants have subdivided into additional sub-groups involving co-evolving mutations.
We differentiated Group 2 into 2 subgroups, 2a and 2b, which involve N_203-204: RG> KR and NS3_Q57H amino acid substitutions, respectively, along with NSP12_P323L and Spike_D614G. Several studies (Ayub, 2020; Lorusso et al., 2020; Yin, 2020) mention trinucleotide block mutations in nucleotides (28881-28883: GGG>AAC) which resulted in 2 amino acid changes (N_203-204: RG> KR) and affected the Serine-Arginine-rich motif of N protein. This trinucleotide block mutations were found in 8 sequences, 3 of them were from Dhaka, Bangladesh. NS3_Q57H mutation variants have been commonly found in the USA (Mercatelli & Giorgi, 2020) and Europe and are predicted to be deleterious (Issa, Merhi, Panossian, Salloum, & Tokajian, 2020).
Unlike the others, Group 3 was unique, with 4 coevolving mutations. Of these, the NSP6_L37F mutation variant was common (Mercatelli & Giorgi, 2020); this mutation variant has also been frequently found in the UK, USA, Australia and India. The other 3 mutations are relatively less common and found mostly in India and Australia. Group 4, on the other hand, consists of a characteristic NS8_L84S mutation variant, which was declared as S type by Tang, X.L. et al. (Tang et al., 2020). This mutation was later reported as C type by another group (Forster, Forster, Renfrew, & Forster, 2020) and were clustered as S clade by GISAID (Fuertes et al., 2020). Group 4 included 4 Bangladeshi variants, isolated from the Chittagong district in May, 2020, along with 3 strain sequences out of 8 from Thailand and only one strain sequence from India.
A recent study conducted with 10,014 sequences identified 13 frequent non-synonymous mutations (Mercatelli & Giorgi, 2020), while we found only 7 of them, along with 3 less common mutation, at high frequency in this region (Figure 2). Most of the spike protein mutations identified in this region were also observed in Europe and North America. Spike protein mutations with aa substitution at 614 position, found in 40% of the studied strains in this region, were also prevalent in Europe and North America. On the contrary, the amino acid substitutions found at the 1109th position of the spike protein found in one Bangladeshi strain was found to be globally common with one strain from Switzerland. We observed another amino acid substitution in the spike protein at the 76th (Spike_T76I) position in an Indonesian strain, which was also found in two strains from West Bengal, India (data non shown). This specific amino acid substitution was identified on 55 occasions according to the global database. Among them, 49 were from Australia, suggesting that this variant might have transmitted from Australia. Another spike protein amino acid substitution (Spike_E471Q) was found in the receptor binding domain of the spike protein. Glutamate (E) was replaced by Glutamine (Q) resulting in a conservative replacement that may not contribute largely in binding to the ACE2 receptor.
Additionally, global mutation distribution statistics showed that Spike_A829T mutation was observed in 31 sequences, all of them from Thailand (Table 2). NSP2_I120F mutation was found in 9 of the 12 cases from Dhaka, Bangladesh and NSP2_D92G mutation was present in 4 out of the 5 sequences from Chittagong, Bangladesh (Data not shown). These cities are separated by a distance of 250 kilometers, suggesting that those viruses carrying novel mutations were circulating in an area-specific manner.
NS mutation and phylogenetic analysis conducted through the Nextstrain database was particularly useful in getting a closer look at mutation variants and their possible routes of transmission. We found a common N_203-204: RG> KR amino acid substitution (9 out of 12 strains) in Dhaka, Bangladesh. However instead of the common N_203-204: RG> KR amino acid substitution, a less common aa substitution was observed at the 202nd position of N protein (N_S202N) among the most (5 out of 7) strains of Chittagong. The mutation distribution database showed that strains having trinucleotide block mutation in N protein were prevalent in Europe and that the N_S202N mutant was found more commonly in recent strains of Saudi Arabia. Phylogenetic analysis by Nextstrain also revealed that the Chittagong strains (belongs to Nextstrain B4 clade and group 4 of our study) have close relationship with the Saudi Arabian strains, while Dhaka strains (A2 clade in nextstrain, group 2 in this study) are similar to the European ones.
The geographical heat-map (Figure 5) of these non-synonymous mutations indicate that most of these mutations were also frequently found in the UK, USA, Australia, Saudi Arabia and other European countries, revealing possible transmission routes to Southeast Asia. Phylogenetic analysis with 329 genomes from this region by Nextstrain produced a similar transmission route map (Figure 6). This study also confirmed, through phylogenetic and mutation analysis, that a high percentage of Group 2 strains are linked to European and North American strains (A2 clade in Nextstrain analysis) in India and Bangladesh.
We could not analyze the strains from Maldives, Bhutan and Timor-Leste because they do not have whole genome sequence data of the virus at the time of our analysis. Among the six countries with available genome sequences of a good quality, only India, Bangladesh and Indonesia have reported a higher number of SARS-CoV-2 infections. The frequencies of infection have increased exponentially from mid-April, 2020. In our study, we additionally analyzed 187 sequences (Figure 4) of which 100 (53%) sequences from India, Bangladesh, Thailand, Indonesia and Srilanka showed characteristic NSP12_P323L and spike_D614G mutations, which put them in the Group 2 cluster (Nextstrain clade A2). It was also shown that Group 2 variants were not found earlier than the 10th of March in this study. The time plot data delineates that this Group 2 cluster is emerging rapidly from 0% in January and February, to 85% in May 2020. In contrast, group 1 strains (similar to the ancestral strain) were not found after the 1st of April, suggesting that the European and North American strains are the most recent predominant strains in this subcontinent. A study conducted in early March reported that NSP12_P323L (14408C>T) and spike_D614G (23403A>G) mutations were recurrent in Europe and had not been detected in Asia until then, supporting our statement (Pachetti et al., 2020). Along with other co-evolving mutations, NSP12_P323L and Spike_D614G probably provide variants with an evolutionary advantage over their ancestral types, allowing them to survive and circulate in this densely populated region.
Although a number of earlier studies hypothesized that high temperatures and high humidity could result in reduced SARS-CoV-2 transmission, the infection rate of SARS-CoV-2 is already increasing in this subcontinent. Given that the European and North American variants (Nextstrain clade A2) are emerging rapidly and that winter is approaching, the next wave of SARS-CoV-2 may take place in Southeast Asia.