INTRODUCTION
The vast biodiversity on earth is the result of billions of years of evolution. All the evolutionary lineages that make up the tree of life belong to three domains: Archaea, Bacteria, Eukaryota, and a fourth contested category of Viruses. Organisms across the tree of life have evolved and adapted to inhabit various environments on earth. Widely accepted studies estimate that about 8.7 million (±1.3 million) species of Eukaryotes (Mora et al., 2011) and up to a trillion species of microbes (Locey & Lennon, 2016) exist on earth. Given these estimates and the completeness of the encyclopedia of life database (Parr et al., 2014), the majority of eukaryotic diversity and most of the microbial diversity remain unknown to science despite 250 years of scientific exploration. At the current rate of novel species discovery, it would take hundreds of years to catalogue all the eukaryotic species alone (Costello et al., 2013). However, there is currently an impending threat of sixth mass extinction due to various anthropogenic factors such as pollution, land-use change, habitat loss, poaching, and climate change (Ceballos et al., 2020). The population sizes of many species have dropped significantly, and species extinction rates have increased hundreds to thousands of times compared to the background rate (Ceballos et al., 2015, 2017). Extinction of species is irreversible and may have long-lasting effects on the ecosystem functioning and services. A global assessment report by the U.N. estimated that up to a million eukaryotic species may be threatened with extinction and might go extinct in the next few decades (Intergovernmental Science-Policy Platform on Biodiversity and Ecosystem Services, 2019). Under the current scenario, many species may go extinct even before being catalogued in the encyclopedia of life. Therefore, it has become imperative to assess and monitor biodiversity at a large-scale than ever before, to chart conservation policies and guide management strategies.
Classical biomonitoring techniques are time-consuming, resource-intensive, require manual identification of specimens, and are not easily scalable to deploy on a large scale. Environmental DNA (eDNA)-based molecular methods offer several advantages over classical biomonitoring methods (Thomsen & Willerslev, 2015). eDNA-based biomonitoring techniques detect the presence of various taxa in the ecosystem utilising the DNA extracted directly from whole environmental samples (e.g., water, soil, air) (Taberlet et al., 2012a). The last decade witnessed tremendous strides in the methodological development of eDNA-based biomonitoring techniques (Seymour, 2019). Along with the technical advances, there has also been considerable effort to understand the ecology of eDNA (Barnes & Turner, 2016; Stewart, 2019), and to clearly define the term eDNA (Pawlowski et al., 2020; Rodriguez-ezpeleta et al., 2020). By exploiting various sources of DNA in an environmental sample (Figure 1), eDNA-based biomonitoring has emerged as a powerful new technique that has revolutionised the way we survey ecological communities (Deiner et al., 2017). Numerous comparative studies have concluded that eDNA-based biomonitoring could complement or even potentially replace classical biomonitoring methods in the future (Leempoel et al., 2020; Piggott et al., 2020).
eDNA-based biomonitoring offers high scalability at four levels: physical, economic, biological, and ecological scalability. First, collecting eDNA samples requires relatively very less effort than most of the classical biomonitoring techniques. For example, filtering water samples to detect fish communities requires significantly less time and resources than surveying using gill nets or electrofishing. Such physical scalability in sampling enables collection of a large number of samples covering entire ecosystems with minimal effort (e.g., West et al., 2021). Second, the high-throughput nature of molecular methods used in eDNA-based biomonitoring such as quantitative PCR and next-generation sequencing reduces the economic cost incurred per sample as the scale of the project increases. Third, eDNA samples typically contain multiple sources of DNA from many different organisms across the tree of life (Figure 1) (Barnes & Turner, 2016; Torti et al., 2015). For instance, a bulk sediment sample can contain microorganisms, invertebrates, extracellular DNA, and other biological particles of various sizes from organisms across a wide range of taxa. Thus, eDNA samples are biologically scalable, in the sense that a single sample can be used to target diverse taxa and the same sample can be repurposed to detect a different set of organisms, eliminating the need for repeated sampling (Dysthe et al., 2018). Fourth, the different types of environmental DNA present in a sample offer ecological scalability by providing a wide range of spatio-temporal resolutions of biodiversity for various applications (Figure 1) (Bohmann et al., 2014). For example, organismal DNA is best suited for studying the functional ecology of microbial communities, while extra-organismal and extracellular DNA are considered as relic DNA that obscures the desired functional signal (Carini et al., 2016; Lennon et al., 2018). Whereas, extra-organismal DNA with a moderate temporal resolution of a few days is suitable for a time-sensitive application such as detecting invasive species where direct sampling of organisms is difficult (Thomas et al., 2020). Further, extracellular DNA bound to suspended surface-reactive soil particles with a high temporal resolution of a few weeks to a few months might benefit long-term biomonitoring with seasonal or annual sampling strategies (Nagler et al., 2018). Finally, extracellular DNA preserved in deep sediments with an extremely high temporal resolution ranging from years to centuries can be used to reconstruct past and paleo-communities (Bálint et al., 2018). Such flexibility and scalability make eDNA-based techniques suitable for the needs of very large-scale biomonitoring programs, required to keep a check on the fast-changing environments in the current Anthropocene.
The methodologies employed in eDNA-based biomonitoring are currently limited to targeted approaches where a specific species or a group of related taxa are detected. Routinely, a quantitative PCR assay is used to detect a single species sensitively, and a universal PCR-based metabarcoding is utilised to detect a group of related taxa sharing a common barcoding marker. Recently, PCR-free approaches such as hybridisation capture sequencing were employed for targeting a single species (Jensen et al., 2020) and multiple species (Seeber et al., 2019). The scientific community readily adopted the targeted approaches of eDNA-based biomonitoring because of two main reasons. First, the idea of targeted monitoring of a single-species or a group of related taxa was synonymous with the existing classical monitoring of a single invasive species, keystone species, or communities of fishes and macro-invertebrates (Seymour, 2019). Second, large reference databases of standardised DNA barcodes (e.g., COI, rbcL-matK, ITS) that are conserved at low taxonomic ranks (e.g., Kingdom) and provide a high taxonomic resolution between closely related taxa were available (Meiklejohn et al., 2019). The International Barcode of Life consortium is amassing millions of DNA barcodes from a wide range of eukaryotic species and plan to complete the library DNA barcodes for all the multicellular organisms with a planetary biodiversity mission in the coming decades. Nevertheless, several disadvantages and limitations are inherent to the targeted approaches that restrict their applicability to the next frontier of eDNA-based biomonitoring known as the next-generation biomonitoring (Bohan et al., 2017). Next-generation biomonitoring aims to push the current limits of classical biomonitoring by incorporating a broader spectrum of ecological organisational levels, from genes to meta-ecosystems, in the eDNA-based biomonitoring framework (Makiola et al., 2020). Since the ecosystem is a continually interacting web of life that influence each other (Bascompte, 2009), next-generation biomonitoring uses a holistic approach by inferring network properties from large-scale multi-layered ecological networks derived using machine-learning approaches (Derocles et al., 2018). The network properties are then directly linked to ecosystem functions and services rather than relying on simple biodiversity metrics such as alpha, beta, and gamma diversity to detect changes in the ecosystem. The key factor limiting the usage of targeted approaches in next-generation biomonitoring is their inability to retrieve all the ecological information that can be, in principle, obtained from an environmental sample. For instance, due to the dependency on standard DNA barcodes that differ between major groups of organisms such as prokaryotes, protists, fungi, plants, and animals, a targeted approach can never be used to determine the relative abundances of all the organisms across the tree of life from an environmental sample in a single assay. Such a holistic data is necessary for building ecosystem-level ecological networks that incorporate trophic and other types of interactions between species separated by large evolutionary distances. Further, the lack of data from non-barcoding genes limits its usage in determining the functional potential of communities and monitoring their genomic diversity. Finally, there are various known technical biases such as DNA extraction bias, amplification bias, marker bias, and primer bias that obscures the original diversity and relative abundances when a wide range of taxa are targeted (van der Loos & Nijland, 2020).
An untargeted and unbiased approach of detection which preserves the underlying structure of biodiversity in terms of relative abundance of biomass across the tree of life is fundamental to reap the full benefits of eDNA-based next-generation biomonitoring programs. Taberlet et al. were the first to envision that a metagenomic approach of performing environmental shotgun sequencing on extremely high throughput sequencing platforms could yield billions of DNA sequences per sample that can be used to monitor biodiversity across the tree of life and overcome the biases and limitations of targeted approaches (Taberlet et al., 2012b). Due to the absence of targeted enrichment steps such as PCR and hybridisation capture, metagenomic approaches provide an unbiased representation of the input library of environmental DNA sequences. The proportion of metagenomic sequences that can be taxonomically annotated will directly depend upon the completeness of reference databases containing whole genome sequences. Currently, the DNA barcode databases have relatively very high species coverage than the whole genome databases due to which metabarcoding is presently more preferable than metagenomics (Stat et al., 2017). Nevertheless, an international moonshot initiative in biology called the Earth BioGenome Project is set to change the scenario of incomplete genome databases by generating genomic resources for all the known eukaryotic species (about 1.5 million) in a record time of over a decade (Lewin et al., 2018). Several large-scale genome sequencing initiatives have joined this massive effort targeting a wide variety of taxa (https://www.earthbiogenome.org/affiliated-project-networks). The earth biogenome project has adopted a phylogenomic approach where a representative reference genome will be generated for each of the families in the eukaryotic lineage in the first phase of the project, followed by representative genomes for every genus in the second phase, and finally sequencing every known species in the third phase. In the near future, with the progress and completion of various genome sequencing initiatives, the availability of reference sequences will not be a barrier for adopting metagenomic approaches. Instead, technical challenges at the field, wet lab, and dry lab will be the limiting factor for transitioning from targeted biomonitoring to untargeted next-generation biomonitoring across the tree of life. Therefore, we aimed to develop a futuristic biomonitoring pipeline for the earth biogenome era, from sampling to data analysis, that could benefit from the growing number of whole genome sequences.
Based on the literature survey, we identified five main challenges that are encountered while monitoring biodiversity across the tree of life in large ecosystems using a metagenomic approach, namely, collection of a large number of samples to obtain an unbiased representation of the biodiversity from a vast area (e.g., Tara oceans project (Sunagawa et al., 2020)), DNA extraction bias due to differences in lysis efficiencies between cell-types from a wide range of taxa (Djurhuus et al., 2017), a low proportion of overall taxonomic assignment of reads due to incomplete reference databases (e.g., Singer et al., 2020), the overwhelming proportion of microbial taxa relative to macrobial taxa due to the high abundance of microbes in environmental samples (e.g., Stat et al., 2017) and a high number of uninformative duplicate reads and artefacts introduced by PCR amplification of sequencing libraries (Kebschull & Zador, 2015). We followed a top-down approach to alleviate these issues by adapting logically deduced solutions at every step of the biomonitoring pipeline such as sample collection, environmental DNA extraction, library preparation, sequencing, and bioinformatics. First, we targeted the extracellular DNA in the environmental samples using large-volume water filtration techniques and a customised DNA extraction protocol. Second, we used a completely PCR-free method of library preparation to obliterate PCR-induced artefacts and duplicates in the reads. Third, we pushed the limits of sequencing by performing ultra-deep sequencing of the PCR-free libraries and obtained billions of paired-end reads per sample to detect the low-abundant macrobial taxa. Fourth, we adapted a pseudo-taxonomic approach of read assignment to approximate the taxa richness when the target organisms are underrepresented in the reference database. Fifth, we used incidence-based community analysis rather than abundance-based statistics to minimise the overshadowing effect of microbial taxa on macrobial taxa. In this paper, we demonstrate the feasibility and reproducibility of our novel metagenomic workflow (Figure 2) to effectively monitor biodiversity across the tree of life in large aquatic ecosystems in a pilot-scale experimental setup. Large-scale implementation of such an untargeted and unbiased approach provides a powerful tool for molecular ecologists to implement next-generation biomonitoring programs in the earth biogenome era.
MATERIALS AND METHODS
Pilot-scale study design
To test our methodology for effective monitoring of biodiversity across the tree of life in a large ecosystem, we selected Chilika, a highly biodiverse tropical lagoon ecosystem located on the east coast of India. Chilika is the second largest brackish-water lagoon in the world spanning about 1100 sq. km. with a unique community assemblage consisting of marine organisms from the Bay of Bengal, the north-eastern part of the Indian ocean and freshwater species from the tributaries of Mahanadi river, a major river system in east-central India (Sarkar et al., 2012). We designed a pilot-scale spatially replicated sampling strategy to demonstrate the feasibility and reproducibility of our metagenomic approach. We selected three equally spaced geolocated stations (S27, S28, S29) on a 10 km transect in the central sector of the lagoon (Figure 3). We targeted the extracellular environmental DNA in the water (Figure 1) at the sampling locations. Extracellular DNA with a very high spatio-temporal resolution is an excellent choice for seasonal or annual biomonitoring programs in vast ecosystems where obtaining the maximum information with minimal sampling is critical. Since Chilika is a shallow water lagoon with an average depth of about 2 metres (Sarkar et al., 2012) and experiences strong coastal winds, the possibility of vertical stratification in the water column is minimal and sampling the surface water for extracellular DNA will suffice for the objective of this study.
Large-volume sample collection
Since extracellular DNA makes up only a fraction of the total eDNA that can be extracted from the water sample (Torti et al., 2015), large volumes of water need to be filtered to obtain sufficient yield of DNA for PCR-free approaches that require relatively more amounts of DNA owing to the absence of any DNA amplification step. We selected a pore size of 0.45um instead of 0.22um to allow for more volume of water to be filtered before clogging due to the high turbidity observed in the sampling locations. The selected pore size also retains suspended soil particles that could be potentially bound by extracellular DNA such as clay (<2um), silt (2-50um), and sand (50um-2mm) (Nagler et al., 2018). Further, we used a 47mm filter membrane made up of mixed cellulose ester (MCE), as it has been shown to bind to the free-form of extracellular DNA due to its chemical affinity (Liang & Keeley, 2013). We filtered about 10 litres of water in each geolocated sampling station on 28th December 2019 using the integrated eDNA sampler by Smithroot Inc. (Thomas et al., 2018). A triplicate sampling module was used to maximise the rate and volume of water sampled per location. We set a maximum vacuum pressure of 10psi to minimise any cell lysis during filtration and maintain a flow rate of less than one litre per minute. The sampling system also avoids the risk of sample contamination by utilising sterile single-use self-preserving filter holders which are replaced for every sampling location. Thus, we did not use any negative filtration controls during the sampling. We transported the samples to the laboratory at room temperature in the dark for DNA extraction. The self-preserving filter holder is made up of desiccating plastic material which completely removes any traces of water and preserves eDNA for several weeks in room temperature (Thomas et al., 2019).
Extracellular DNA enrichment
We adapted and modified a modular lysis-free phosphate buffer-based DNA extraction protocol from the literature (Lever et al., 2015; Liang & Keeley, 2013; Taberlet et al., 2012c) to enrich the extracellular DNA and minimise the proportion of organismal and extra-organismal DNA (Figure 1) extracted from the filter membranes. The main principle of the extraction protocol is to desorb the extracellular DNA bound to the surface of the MCE filter membrane and soil particles without lysing the intact cellular particles using the saturated phosphate buffer. The phosphate groups from the buffer compete with the phosphate groups of the extracellular DNA bound to the surface of soil particles via cation bridging and desorb the DNA by chemical displacement (Taberlet 2012c). The desorbed DNA is then isolated through a column-based DNA isolation protocol using reagents and columns from the Nucleospin soil kit (Macherey-Nagel, Germany). Within a week of sample collection, we performed extracellular DNA extraction from the filter membranes in a clean lab. The bench surfaces were wiped with 50% diluted commercial bleach, followed by distilled water and 70% ethanol. We used filter-tips to pipette all the liquids during the extraction to avoid any aerosol contamination. The phosphate buffer was freshly prepared before extraction by mixing 0.197g of NaH2PO4 and 1,47g of Na2HPO4 in 100ml of DNA-free water (0.12M, pH 8). Filter membranes were carefully taken out of the filter holders and rolled using sterile forceps before placing them into 15ml falcon tubes containing 5ml phosphate buffer and large ceramic beads (0.6-0.8mm) from the Nucleospin soil kit. The falcon tubes were shaken for 10-15 minutes by placing them on a vortex-mixer with the vertical falcon holder module. The large ceramic beads help homogenise soil clumps recalcitrant to the desorption process but not disrupt the intact cells. This process is principally different from bead-beating employed in microbiology to lyse cells. We took care not to exceed the time of mixing beyond 15 minutes to avoid co-extracting large proportion of humic acids. The homogenised mixture was immediately centrifuged at 11000 x g to precipitate particulate matter, and the supernatant was passed through the Nucleospin inhibitor removal column. The DNA-binding condition of the flow-through was then adjusted using the binding buffer, and then passed through the silica column. DNA was eluted using 150ul of Tris EDTA buffer from the Nucleospin soil kit. We did not include any negative controls during extraction since the high amount of input DNA required for the PCR-free library preparation protocols cannot be obtained from negative extraction controls.
PCR-free library preparation and ultra-deep sequencing
We quantified the extracellular DNA extract using a high-sensitivity double-stranded DNA assay in Qubit 4 (Thermo Fisher Scientific, USA). DNA concentration was diluted to 20ng/ul, and one microgram of DNA was taken as input for library preparation. We chose the Illumina Truseq DNA PCR-free library preparation method to avoid PCR-induced artefacts such as substitutions, indels, chimaeras, and drastically reduce uninformative duplicate reads that arise from library amplification. We did not include technical replicates for library-preparation since there is no stochasticity involved in library preparation due to a completely PCR-free workflow. Moreover, the spatially-replicated sampling strategy allows us to test the reproducibility of our approach. The input DNA was first randomly sheared into 350bp fragments using the Covaris ultra-sonicator. The ends of the fragmented DNA were repaired and dA-tailed prior to ligation using unique dual (UD) indexing adaptors for Illumina (IDT, USA). The UD indices significantly reduce tag jumps that cause cross-contamination across multiplexed samples. The adaptor-ligated library fragments were size-selected and purified with SPRI beads (Beckman Coulter, USA). The fragment sizes of the libraries were verified using Agilent Bioanalyzer high-sensitivity DNA chip. The libraries were quantified using a qPCR library validation kit (Takara Bio, USA) and the concentrations were adjusted before pooling with other libraries. We adjusted the relative concentration of our libraries to yield billions of paired-end reads per sample which would enable us to detect low-abundant macroorganisms. The pooled libraries were denatured into single-stranded DNA before loading onto a patterned flow cell and sequenced for 300 cycles on the Novaseq platform. The Novaseq 6000 is the latest series of next-generation sequencer from Illumina that has drastically improved the quality of short reads and provides extremely high-throughput sequencing capabilities suitable for our deep sequencing requirements.
Sequence analysis strategy
The process of taxonomic annotation of metagenomic reads directly depends on the availability of reference genome sequences from the target organisms, the sensitivity of the algorithm used to detect homology, and the taxonomic resolution of the genomic loci. Currently, the proportion of taxonomically classifiable reads would be very low because only a small fraction of all the known species have their genomes assembled, annotated, and archived in public nucleotide sequence databases. Further, different regions in the genome provide variable taxonomic resolutions depending on the sequence complexity, mutation rate, selection pressure, recombination, and evolutionary history of the species (Coissac et al., 2016). In general, the proportion of taxonomically classified metagenomic reads reduces as we move from lower taxonomic levels towards species-level resolution. Under these circumstances, the Lowest Common Ancestor (LCA) algorithm originally implemented in MEGAN (Huson et al., 2007) provides a robust taxonomic assignment to the queried sequences. The LCA algorithm requires the output of sensitive search algorithms such as BLAST to assign a taxonomic label to the query sequence. However, alignment-based homology detection algorithms such as BLAST are prohibitively slow to query billions of reads against large reference databases. Alternative alignment-free kmer-based algorithms such as KRAKEN (Wood et al., 2019) are thousands of times faster than BLAST but very less sensitive and cannot find homology between divergent species (Lindgreen et al., 2016). Therefore, we selected a hybrid search algorithm called MMseqs2 (Steinegger & Söding, 2017) which first prefilters hits for each query using the shared short kmers (k=15) and then utilises a vectorised Smith-Waterman alignment to infer homology between the query reads and the database hits. MMseqs2 also provides flexibility to fine-tune the sensitivity parameter based on the available computing resources, size of the query datasets, and size of the reference database.
Pseudo-taxonomic assignment of reads
We adapted a pseudo-taxonomic assignment approach using the dual blast lowest common ancestor algorithm (2bLCA) (Hingamp et al., 2013) implemented in MMseqs2 (Steinegger & Söding, 2017). The algorithm searches for a homologous locus from the nearest sequenced species and finds the best hit when the reference database does not contain the sequence from the target species. Further, the taxonomic resolution of the locus is assessed by querying the aligned region of the best hit against the database. If the locus is not sequenced in any related species, the taxonomic label of the best hit will be assigned to the original query; otherwise, the LCA of all the significant hits relative to the best hit (in terms of E-value) is calculated. Currently, due to incomplete reference genome databases, most of the taxonomic labels represent pseudo-taxonomic assignments and not the real taxonomic identity of the query sequence. Therefore, the taxonomic labels of metagenomic reads cannot be considered as a direct indicator of species presence in the sampled environment. Instead, it might only indicate the presence of a related species that is not represented in the database in most of the cases. We constructed the reference database using the non-redundant nucleotide sequences (nt) and their taxonomic labels from the International Nucleotide Sequence Database Collaboration (INSDC), downloaded on 26th October 2020 from the NCBI FTP servers. The nt database also consists of annotated regions from the NCBI Genbank and RefSeq genome assemblies. We filtered the sequences shorter than 100bp or longer than 1Mb, and repetitive sequences based on the sequence annotation keywords (satellite, repeat, repetitive, transposon, intersperse, tandem). The remaining sequences were clustered at 99% identity to reduce redundancy using LINCLUST (Steinegger & Söding, 2018). We masked the low-complexity regions in the representative sequences of each cluster using the DUSTMASKER algorithm implemented in BLAST+ (Camacho et al., 2009). The raw sequencing reads were also quality trimmed and low-complexity filtered using FASTP (Chen et al., 2018) prior to taxonomic assignment. We used the easy-taxonomy workflow implemented in MMseqs2 to taxonomically annotate the reads with the 2bLCA algorithm (Hingamp et al., 2013). We used an e-value threshold of 1e-5, a maximum sequence divergence of 60% identity, and a minimum query coverage of 80%. We obtained the 2bLCA taxonomic assignments for each read of the paired-end reads independently and then calculated the LCA for every pair. To reduce the false-positive assignments, we filtered very low frequency taxa at the species, genus and family levels which contained either less than 1% of reads relative to the parent taxon or less than ten reads, whichever was highest among the two thresholds.
Incidence-based community analysis
The number of reads assigned to each taxon (copy number) is dependent on the abundance of the taxon in the environment, the genome size, and the fraction of the genome sequence available in the reference database. Thus, raw read counts do not represent the true abundance of a taxon when the reference database is incomplete. Moreover, the high abundance and diversity of microorganisms in the environment inevitably causes overrepresentation of the microbial taxa and obscures the taxonomic signal from the macrobial taxa. Therefore, we adapted an incidence-based statistical framework instead of the standard abundance-based community analysis (Colwell et al., 2012). We divided each metagenomic sample containing billions of reads into multiple sampling units of 100 million reads each. In each sampling unit, we obtained the counts of reads that are taxonomically classified at least up to the genus level. We restricted our analysis to the genus level because the proportion of taxonomically misidentified sequences at the genus level in the Genbank database is estimated to be less than 1% (Leray et al., 2019) and prohibitively higher at the species level (Locatelli et al., 2020). We then estimated the asymptotic genus richness of various taxa in each sample (hill number of 0th order) using statistical extrapolation of genus accumulation curves of the observed incidences of different genera in each of the sampling units (Chao et al., 2014). The interpolation and extrapolation of genus accumulation curves were performed using the iNEXT R package with hundred bootstraps (Hsieh et al., 2016). We then assessed the sample completeness as a function of sequencing depth to estimate the optimal depth required to detect maximum diversity across various taxa. Finally, we measured the community similarity among the three spatially-replicated samples using the number of unique and shared taxa in the SpadeR R package with hundred bootstraps (Chao & Jost, 2015). SpadeR estimates the Jaccard and Sorenson indices by considering the differences in sequencing depth among the samples.
RESULTS
Biomonitoring across the tree of life
We generated a total of 851.43 Giga bases of sequencing data from the three spatially-replicated extracellular environmental DNA samples. Upon demultiplexing, we obtained about 2.6 billion paired-end reads from the Station 29 sample (S29), and 1.5 billion paired-end reads each from Station 28 (S28) and Station 27 (S27) samples. After quality and low-complexity filtering, we retained an average of 98.19% (SD 0.39%) of reads for taxonomic assignment. Since we used a completely PCR-free library preparation method, the average duplication rate was 6.65% (SD 0.31%) arising from optical duplicates and sequencing of the complementary strand. We derived 26 sampling units of 100 million reads from the sample S29 and 14 sampling units each from S28 and S27, which were separately analysed for determining the taxonomic composition. We obtained a set of 35.5 million reference sequences by filtering and clustering about 60.9 million nucleotide sequences from the INSDC at 99% sequence identity. By searching the high-quality reads against the constructed reference database using MMseqs2, we annotated on average 8.71% (SD 0.54%) of the reads to one of the levels of NCBI taxonomy using the 2bLCA algorithm. The paired-end structure of the reads allowed us to correct the discordant taxonomic assignments using the independently derived LCA labels for each read in the pair. We obtained a total of 353.07 million pair-merged pseudo-taxonomic assignments across the three samples. The taxonomic assignments were spread across the tree of life with different proportions, specifically, Viruses (51.41%), Archaea (0.32%), Bacteria (44.23%), and Eukaryota (4.02%). Overall, about 21.53% of the pseudo-taxonomic assignments had a resolution at least up to the genus level, which provided an estimate of 1815 genera across the tree of life in our samples. Eukaryota had the highest proportion of reads classified up to the genus level (45.65%), and the Viruses had the lowest proportion with just 3.67%. Specifically, the results indicated the presence of 47 genera of Viruses, 64 genera of Archaea, 675 genera of Bacteria, and 1029 genera of Eukaryotes among the sequenced samples (Figure 4). Further, eukaryotic genera were distributed across all the kingdoms, namely, Protists (32.36%), Fungi (18.85%), Viridiplantae (18.85%), and Metazoa (29.93%) (Table 1A). Since the low abundant macro-organisms are the most difficult to detect, we inspected the number of genera observed among the commonly targeted groups of Metazoa in the sequenced samples. The 308 Metazoan genera comprised of 39.61% of chordates and 60.39% of invertebrates. Phylum Arthropoda with 97 genera and class Actinopteri with 61 genera had the highest representation among the invertebrates and chordates. Among the low represented groups were Mammalia with 30 genera, Aves with 14 genera, Amphibia with four genera, Chondrichthyes with three genera, and Reptiles with one observed genus.
Optimal sequencing depth for biomonitoring
To estimate an optimal depth of sequencing to detect maximum diversity with minimal effort, we generated extrapolated genus accumulation curves across various taxa for each sample as a function of sequencing depth. All the accumulation curves show saturation as the sequencing depth increases beyond 1 billion paired-end reads (Figure 5). We then estimated the asymptotic genus richness of each taxon in all the samples using the accumulation curves (Table 1B). The asymptotic genus richness represents the maximum diversity of a particular taxon that can be detected in the given sample. Using the asymptotic genus richness of respective samples as a reference point, we then estimated sample coverage as a function of sequencing depth (Figure 6). Sample coverage is the fraction of genera detected by at least 1 read at a given depth. At the lowest depth of 100 million reads, samples have the highest coverage for Bacteria (AVG 93.59%, SD 1.54%) followed by Viruses (AVG 87.75%, SD 1.52%) and the lowest coverage for Aves (AVG 55.51%, SD 15.03%) followed by Mammalia (AVG 60.42%, SD 14.13%). At a read depth of 1 billion reads, all the taxa have coverage greater than the threshold of 90% (Table 1C) with the lowest coverage for Mammalia (AVG 91.74%, SD 13.7%). With double the sequencing effort of 2 billion reads, the coverage increases only by 3.88% for Mammalia (AVG 95.62%, SD 7.57%), crossing the 95% threshold for all the taxa under consideration (Figure 6).
Effectiveness of the pseudo-taxonomic approach
Since we adopted a pseudo-taxonomic assignment strategy to alleviate the issue of incomplete databases, the taxonomic assignments may only indicate the presence of a related species when the target species is not represented in the reference database. Our spatially-replicated sampling strategy allowed us to estimate the reproducibility of the pseudo-taxonomic approach by comparing the unique and shared number of genera across various taxa among the samples. The community similarity among the samples was measured using the Sorenson and Jaccard indices across various taxa (Figure 7). The lowest community similarity was found among Protists with a Sorenson index of 0.96 (Table 1D) and a Jaccard index of 0.91 (Table 1E). To further examine the effectiveness of the pseudo-taxonomic approach for detecting low abundant taxon that is underrepresented in the reference database, we chose the Irrawaddy dolphin (Orcaella brevirostris), an endangered flagship species of the Chilika ecosystem as a case study. Since the whole genome of the Irrawaddy dolphin has not been assembled, we examined hits to related species whose annotated regions of the genome are available in the reference database. We found 12,399 hits to Lipotes, a possibly extinct genus of Chinese river dolphin endemic to the Yangtze river in China, whose whole genome sequence is available in the reference database. There were no other hits in our filtered taxonomy dataset to any other marine mammals in the reference database (Supporting data).
Discussion
Biomonitoring across the tree of life is plausible
Our results show that PCR-free ultra-deep shotgun sequencing of extracellular environmental DNA is a promising approach for next-generation biomonitoring across the tree of life in large aquatic ecosystems. By generating one of the deepest shotgun sequencing datasets for environmental DNA to date, we push the limits of biomonitoring to detect biodiversity even from low abundant macroorganisms in the ecosystem (Figure 4). We also achieve very low duplication rates in the sequences by eliminating PCR from the laboratory workflow, which otherwise may render a considerable part of the data useless by reducing the effective depth. However, due to stringent filtering and relatively less sensitive search algorithm, we obtain low proportions of taxonomically annotated sequences (recall rate) compared to studies using BLAST searches (e.g., Cowart et al., 2018). The compromise with the recall rate was necessary for our study as BLAST is computationally infeasible to search billions of reads against large reference databases. Nevertheless, less sensitive algorithms than BLAST such as MMseqs2 used in this study are likely to miss only evolutionarily very distant homologs and not the related species. Thus, most of the unclassified reads due to less sensitivity may have been classified to very low taxonomic ranks below family level by the LCA algorithm, which does not affect our analysis at the genus level. On the contrary, our choice of analysing the paired-end reads independently and merging them after verifying their taxonomic annotation provides us with more confidence in the classification at the cost of reducing the number of reads assigned to the genus level due to reduced read length of 150bp compared to a merged read length of 300bp. Further studies are required to investigate the effect of such computational choices on the results to optimise the computational workflow. Although Eukaryotes made up only 4.02% of the total taxonomic assignments, the number of genera detected was higher in Eukaryota than Viruses and Bacteria, which made up most of the classified reads, possibly due to three reasons. First, 4.02% of eukaryotic reads accounted for over 14.2 million taxonomic assignments due to the high depth of our dataset. Second, Eukaryota had the highest proportion of reads classified up to the genus level with about 6.49 million reads. Such a large number of reads with high-taxonomic resolution provides excellent power to detect low abundant macro-organisms that is comparable to targeted studies. Third, despite the high abundance of Viruses and Bacteria in the environment, more than 99% of their diversity is yet to be catalogued (Locey & Lennon, 2016). Hence, we observe a greater number of eukaryotic genera than Bacteria and Viruses. Further, Viruses had the least proportion of reads classified up to genus-level (3.67%), likely due to the high mutation rates and divergence between lineages requiring protein level comparisons to detect remote homology (Nooij et al., 2018). The low number of Viral genera detected could also be due to the missing diversity of RNA viruses that cannot be detected with DNA-based direct metagenomic methods (Zhang et al., 2019). Additionally, our results demonstrate successful detection of a large number of genera from various groups of Metazoa which indicates the presence of detectable amounts of DNA from macro-fauna in the extracellular eDNA. The main limiting factor for detecting macroorganisms is the depth of sequencing that determines the sensitivity towards low-abundant taxa. In fact, a previous study with shallow depth (22.3 million) concluded that metagenomics is not a suitable approach for biomonitoring eukaryotes from an abundance perspective (Stat et al., 2017). In contrast, another study with a better depth (328.7 million) concluded that eDNA metagenomics is suitable for detecting marine benthic fauna from a presence/absence viewpoint (Cowart et al., 2018). We utilise an extremely high-throughput next-generation sequencer to generate billions of reads per sample and show that biodiversity across the tree of life, including low abundant macroorganisms, can be reliably monitored using an incidence-based statistical framework. We detected a high diversity of commonly monitored groups such as Arthropoda and Actinopteri, that are consistent with the known inventories of highly diverse species occurring in the sampled ecosystem (Suresh et al., 2018). Reptiles had the lowest diversity among the inspected taxa, likely due to their low shedding rates of eDNA (Adams et al., 2019). A commonly observed silver lining in eDNA-based studies is the detection of unintended organisms, also referred to as the by-catch. eDNA studies in aquatic ecosystems have demonstrated the ability to monitor even the terrestrial species occurring in the catchment area (Deiner et al., 2016). Therefore, a significant proportion of the detected taxa in our study may represent terrestrial organisms inhabiting the catchment area which can be distinctly noticed in the class Mammalia containing taxonomic assignments of rodents, bats, and a high number of reads from Human, Dog, Cat, and domestically farmed animals (Supporting data), possibly originating from the untreated sewage runoffs into the ecosystem.
One billion reads may be optimal for biomonitoring
Although the sequencing costs have drastically reduced over the last decade, biomonitoring projects with a large number of samples may require considerable funding to perform ultra-deep sequencing. Hence, we also determine the optimal sequencing depth required to detect maximum biodiversity with minimal sequencing effort. Since the abundance of organisms across the tree of life vary widely, we define the optimal depth as the discrete sequencing depth (in multiples of 100 million) at which all the examined taxa, including the low-abundant ones, cross a maximum sample coverage threshold (at the genus level) that cannot be further increased with minimal sequencing effort. In this study, we generated a total of 5.6 billion paired-end reads but obtained unequal depths of sequencing data across samples due to the differences in library concentrations. Statistical extrapolation of genus accumulation curves allowed us to mitigate the technical issues caused by unequal sequencing depths and predict values up to a maximum sequencing depth of 3 billion reads per sample, which is the practical limit under current sequencing costs. Our results indicate that about one billion paired-end reads may be optimal for biomonitoring across the tree of life, which could detect about 90% of the diversity (Figure 6). Doubling the depth from one billion to two billion only increased the coverage threshold to 95%, indicating saturation. Filtering false positives from low abundant taxa with an extremely small number of reads is challenging, especially when the number of samples is low. One can take advantage of the presence of low abundant taxa in multiple spatiotemporally replicated samples for differentiating them from false positives. Currently, generating one billion paired-end reads per sample may be feasible for small to moderately sized biomonitoring programs when they utilise next-generation sequencers such as the Illumina Novaseq 6000. The Novaseq platform can deliver paired-end reads up to 150bp in length at extremely high depths in a single run reducing the cost incurred per sample. Such sequencing depths are also now typical in various applications in biology. For instance, re-sequencing a human genome at 30x coverage requires 100 Gb of sequencing, and de novo assembling a mammalian sized genome at 100x coverage requires 300 Gb of sequencing, whereas one billion 150bp paired-end reads for biomonitoring requires 150 Gb of sequencing. In contrast, targeted approaches such as metabarcoding may require only 10 million reads per library to achieve good sensitivity (Singer et al., 2019). Due to the relatively low-depth requirement per library, metabarcoding may seem to be a better choice than metagenomics. However, the cost of targeted approaches also increases significantly considering the number of technical replicates required to alleviate PCR-induced stochasticity, libraries with different primers targeting a taxon to reduce primer-bias, and libraries with different markers to target different groups of taxa to achieve tree of life metabarcoding (Stat et al., 2017; van der Loos & Nijland, 2020). For example, to implement a completely unbiased biomonitoring program across the tree of life using metabarcoding, one may need at least three technical replicates per primer pair, two different sets of primers per marker for each of the six groups, namely, Archaea (16S, 23S), Bacteria (16S, 23S), Protists (18S, 28S), Fungi (18S, ITS), Plants (matK, rbcL), and Animals (COI, CytB) which can add up to 72 libraries per location (3 technical replicates x 2 primer sets x 2 markers x 6 taxa) requiring 720 million reads. Although the tree of life metabarcoding example is an extreme case on the higher-end, ultra-deep shotgun sequencing of eDNA is a much simpler choice with relatively fewer complications and might become more attractive with the further reduction of sequencing costs in the future.
The pseudo-taxonomic approach is reproducible
Currently, for most non-model organisms whose genomes are not yet assembled, the reference sequence databases contain only a few loci that are routinely used in barcoding and population genetics. Also, due to a high proportion of uncatalogued novel taxa in the environment, a large number of sequenced reads may not directly match the source taxa from which they originated. Taxonomy-free approaches are being employed in metabarcoding studies to eliminate the dependency on the reference databases (Mächler et al., 2020). In marker-based targeted studies, the sequences derived arise from a single locus and can be clustered into operational taxonomic units (OTUs) based on sequence identity (Westcott & Schloss, 2015) or amplicon sequence variants (ASVs) based on the denoised sequences (Tikhonov et al., 2015). The relative abundance of different OTUs or ASVs provides a taxonomy-free approach to compare diversity across samples (Callahan et al., 2017). But, in case of metagenomic studies, the sequences arise from loci across the genome with varying taxonomic resolutions. Further, higher organisms with large genome sizes may only have a fraction of their genome sampled. In such a scenario, utilising a taxonomy-free approach is not feasible. Hence, we adapted a pseudo-taxonomic approach with advantages of taxonomy-based and taxonomy-free approaches by being partially dependent on reference databases. In the pseudo-taxonomic approach, the nearest available species is chosen as the best hit when the locus from the target species is not available in the reference database. Our results from the spatially-replicated samples have very high community similarity (Figure 7) indicating high reproducibility of the pseudo-taxonomic approach. To illustrate, we also provide an example of the Irrawaddy dolphin, an endangered flagship species of Chilika, whose whole genome is not available in the database but expected to be present in our samples. Interestingly, thousands of sequences from the Irrawaddy dolphin hit the Chinese river dolphin genome in the database, even though there were other genomes in the database from the family Delphinidae that may be more phylogenetically closer. We suspect that the annotated gene sequences from the other Delphinidae genomes might have been clustered to the Chinese river dolphin gene sequences due to high sequence similarity while reducing the redundancy during the process of database construction. The pseudo-taxonomic approach will provide more evolutionarily closer hits to the source taxon with the availability of well-populated databases in the future.
The high similarity in community composition across spatially-replicated samples also indicates that the spatial resolution of biodiversity provided by extracellular DNA is much higher than 10 km in the sampled ecosystem. The high spatial resolution of biodiversity could be due to the homogenous distribution of extracellular DNA in the water column and implies that extracellular DNA may be suitable for monitoring large ecosystems by effectively reducing the number of samples required to represent spatially large ecosystems. Our study also emphasises the importance of pilot-scale studies before designing large-scale biomonitoring programs to evaluate the spatio-temporal resolutions of the targeted type of environmental DNA (Figure 1) in the ecosystem under consideration.
Limitations and conclusion
Successful application of technologies to solve practical problems requires a thorough understanding of its technical strengths and weaknesses. Numerous studies in the past decade have revealed the limitations of the widely-used PCR-based metabarcoding technique (Elbrecht et al., 2017). Such targeted approaches may only be suitable for monitoring a particular taxon of interest but not the entire tree of life due to the inherent inability to provide the relative abundances across major groups of organisms that have different standardised barcoding loci. Moreover, next-generation biomonitoring employs a more holistic approach of monitoring a wide range of ecological entities from genes to entire ecosystems which requires data from the entire genome of individuals from various species across the tree of life (Bohan et al., 2017). Hence, advancements in metagenomic technologies are an indispensable requirement for deploying next-generation biomonitoring at a global scale (Makiola et al., 2020). Our novel metagenomic workflow (Figure 2) with adaptations at every level of the pipeline from sample collection, DNA extraction, library preparation, sequencing, and bioinformatics is an attempt to develop futuristic technologies that can deliver the required data in an unbiased way for next-generation biomonitoring. Nevertheless, our metagenomic approach also has some limitations that warrant further testing and optimisation. Although we achieved our objectives with a limited number of samples in a pilot-scale study in a model ecosystem, more testing is necessary for different ecosystems where obtaining microgram quantities of extracellular DNA may be difficult (e.g., Tundra ecosystem). Further, applicability to terrestrial ecosystems should be rigorously tested since the extracellular DNA isolated from the soil, pooled from multiple sampling plots in terrestrial ecosystems, may only be useful to detect microbes, plants, and sub-terranean organisms, but not large terrestrial animals. Such cases may need innovative sampling strategies such as collecting water samples from waterholes routinely visited by animals (Seeber et al., 2019). Next, obtaining one billion reads per sample may be economically prohibitive for very large-scale biomonitoring programs at the current costs of sequencing. Hence, a metagenomic approach may not be widely adopted by researchers and ecosystem managers until the deep sequencing costs become more affordable. Finally, our approach is currently limited to monitoring broad-scale changes in the ecosystem using pseudo-taxonomic labels at the genus level using uncurated reference sequence databases. For applications such as resolving fine-scale community structure at the species level, targeted approaches using curated barcode databases may be more suitable than metagenomics (Singer et al., 2020). Despite the limitations, our approach has the potential to be adapted and incorporated into large-scale next-generation biomonitoring programs in the future due to three key features. First, the possibility of simultaneously and effectively detecting biodiversity across the tree of life in a single assay that enables inferring species interactions across kingdoms by tracking changes in the relative abundances. Second, the multi-faceted nature of the generated data from the whole genome allows for monitoring diversity at the gene level, mapping of functional traits to specific taxa, and linking community changes to ecosystem functioning and services. Third, the high spatio-temporal resolution of biodiversity provided by our approach through employing extracellular DNA (Figure 1) enables monitoring of large ecosystems by significantly reducing the sampling effort. Plummeting sequencing costs and increasing large-scale genome sequencing projects provide an excellent opportunity to further test and optimise our novel metagenomic workflow under different conditions. We believe that this study advances our understanding regarding the capabilities of eDNA-based metagenomics that is necessary for a paradigm shift towards implementing large-scale next-generation biomonitoring programs across the tree of life in the earth biogenome era.