1. INTRODUCTION

The vast biodiversity on earth is the result of billions of years of evolution. All the evolutionary lineages that make up the tree of life belong to three domains: Archaea, Bacteria, Eukaryota, and a fourth contested category of Viruses. Organisms across the tree of life have evolved and adapted to inhabit various environments on earth. Widely accepted studies estimate that about 8.7 million (±1.3 million) eukaryotic (Mora et al., 2011) and up to a trillion species of microbes (Locey and Lennon, 2016) exist on Earth. Despite over 250 years of scientific exploration, the majority of eukaryotic diversity and most of the microbial diversity remain unknown to science (The Catalogue of Life, 2023). However, there is an impending threat of a sixth mass extinction due to anthropogenic factors such as pollution, land-use change, habitat loss, poaching, and climate change (Ceballos et al., 2020). The population sizes of many species have dropped significantly, and species extinction rates have increased hundreds of times compared to the background rate (Ceballos et al., 2017, 2015). Currently, about 28% of the 150,388 species of animals and plants assessed by the IUCN are threatened with extinction (IUCN Red List of Threatened Species, 2022). Using the IUCN Red List data, a global assessment report by the U.N. estimated that up to a million eukaryotic species may be threatened with extinction and might go extinct in the next few decades (Watson et al., 2019). The extinction of species is irreversible and may have long-lasting effects on ecosystem functions and services. Under the current scenario, many species may go extinct even before being described. Therefore, it has become imperative to assess biodiversity at larger scales than ever before to chart conservation policies, guide restoration projects, and devise management plans.
Classical biodiversity assessment techniques are time-consuming, resource-intensive, require manual identification of specimens, and are not easily scalable to deploy in large ecosystems. The advent of DNA-based identification of species using large reference databases of standardized DNA barcodes (e.g., COI, rbcL-matK, ITS) has led to the development of high-throughput methods to assess the composition of communities from bulk and environmental samples (Creer et al., 2016). Environmental DNA (eDNA)-based biodiversity assessment techniques detect the presence of species in the ecosystem using DNA extracted directly from environmental substrates (e.g., water, soil, air) without isolating the target organisms (Taberlet et al., 2012a). eDNA-based bioassessment offers several advantages over classical methods (Thomsen and Willerslev, 2015). For example, filtering water samples to detect fish communities requires less time and resources than surveying fish using trawling, gill nets, or electrofishing. Such physical scalability in sampling enables the collection of samples covering entire ecosystems with minimal effort. In addition, the same eDNA sample can be repurposed to detect a different set of organisms, eliminating the need for repeated sampling (Dysthe et al., 2018). By exploiting various sources of DNA in an environmental sample, eDNA-based bioassessment has emerged as a powerful new technique that has revolutionized the way we survey ecological communities (Deiner et al., 2017). The last decade witnessed tremendous strides in the methodological development of eDNA-based bioassessment techniques (Capo et al., 2021; Carøe and Bohmann, 2020; Hinlo et al., 2017; Ushio et al., 2022). Along with the technical advances, there has also been considerable effort to understand the ecology of eDNA (Barnes and Turner, 2016; Stewart, 2019) and to clearly define the term eDNA (Pawlowski et al., 2020; Rodriguez-Ezpeleta et al., 2021). Numerous comparative studies have concluded that eDNA-based bioassessment could complement or even potentially replace classical bioassessment methods in the future (Brantschen et al., 2021; Leempoel et al., 2020).
However, the current methodologies employed in eDNA-based bioassessment are limited to approaches where a specific species or a group of related taxa are targeted. A quantitative PCR assay is routinely used to detect a species of interest with high sensitivity using species-specific primers and a PCR-based metabarcoding is employed to detect a group of related taxa using a universal primer targeting a barcode region (Bruce et al., 2021). Recently, hybridization capture by oligonucleotide probes was also employed for the targeted detection of a single species (Jensen et al., 2021) and multiple species of interest (Seeber et al., 2019). The inherent limitations of such targeted approaches are that they only permit the assessment of organisms sharing a common barcoding marker and suffer from enrichment bias leading to considerable and unpredictable dropout of species when targeting a large number of taxa (van der Loos and Nijland, 2021, 2021). But ecosystems are a continually interacting web of life and any anthropogenic effect on one group of taxa affects many other taxa in the ecological network and influences the overall ecosystem stability (Bascompte, 2009). Hence, we must develop a more holistic approach that encompasses all the life forms inhabiting an ecosystem. Therefore, an untargeted approach that can detect organisms irrespective of taxonomic affiliation is fundamental for scaling up the biomonitoring efforts and monitoring the fast-changing environments in the Anthropocene.
In this study, we explored whether we can effectively assess the taxonomic diversity across the tree of life in an ecosystem using PCR-free approaches. Technically, shotgun sequencing of environmental DNA on high-throughput sequencing platforms could yield billions of DNA sequences that can be used to assess biodiversity in an untargeted manner and overcome the biases and limitations of targeted approaches (Taberlet et al., 2012b). Due to the absence of any target enrichment steps such as PCR with universal primers or hybridization capture with DNA/RNA probes, metagenomic approaches can provide an unbiased representation of the input library of eDNA. However, the eDNA extracted from the samples should be a good representation of the total biodiversity in the sampled location. In this regard, the ecology of eDNA encompassing the origin, state, fate, and transport of eDNA in an ecosystem has to be given due consideration (Barnes and Turner, 2016). The various sources of environmental DNA in an ecosystem offer different snapshots of biodiversity at a wide range of spatiotemporal resolutions (Fig. 1A) (Bohmann et al., 2014). The DNA released into the environment by the natural cell lysis of the organismal and extra-organismal biological entities constitutes the extracellular eDNA (Nagler et al., 2022). Once released into the environment, extracellular eDNA adsorbs onto surface-reactive soil particles through cation bridging and becomes resistant to degradation (Nagler et al., 2018). In aquatic ecosystems, the particle-bound extracellular eDNA can remain suspended and spatially dispersed in the water column until sedimentation or degradation. Therefore, we hypothesized that the extracellular eDNA is a natural repertoire of the genetic material of organisms inhabiting an ecosystem and is suitable for taxonomic diversity assessment across the tree of life over large spatiotemporal scales.
To test our hypothesis, we devised a modular workflow (Fig. 1B) from sampling to data analysis and asked the following key questions in this study: (i) Can we simultaneously detect organisms across the tree of life in a single assay using extracellular eDNA? (ii) Is it possible to estimate the total taxonomic richness of an ecosystem across the tree of life? (iii) Does extracellular eDNA provide enough spatiotemporal resolution to detect changes in biodiversity across the tree of life? Using the Chilika lagoon as a model ecosystem, we show that taxa across all the domains of life can be detected through PCR-free deep sequencing of extracellular eDNA enriched from large-volume water samples. Further, using seasonal samples, we estimate the asymptotic taxonomic richness across the tree of life and resolve the changes in biodiversity at broad spatiotemporal scales. We conclude that PCR-free deep sequencing of extracellular eDNA is an effective tool to assess the taxonomic diversity across the tree of life in large ecosystems.

2. METHODS

2.1 Model Ecosystem

To test our approach for assessing the taxonomic diversity across the tree of life, we selected Chilika Lagoon, a highly biodiverse tropical brackish water ecosystem located on the east coast of India as our model ecosystem (Supplementary Fig. 1). Chilika Lagoon was designated as India’s first Ramsar site (no. 229) of international importance in 1981. It is the second largest brackish-water lagoon in the world, extending about 64 km in length and 20 km in width and spanning about 1100 sq. km of the area during the monsoon season. As the lagoon receives freshwater and marine water, there exists a dynamic gradient of salinity ranging from 0-5 ppt in the northern sector to 5-18 ppt in the central and southern sectors. As a result, Chilika harbors a unique and diverse community assemblage consisting of marine organisms from the Bay of Bengal – the northeastern part of the Indian Ocean, and freshwater species from the tributaries of the Mahanadi River, a major river system in east-central India (Supplementary Fig. 1). It has earned an economically important status by supporting the food and livelihood of over 200,000 fisherfolk. Moreover, the fish diversity of Chilika has been well documented and a comprehensive checklist of fishes sighted in the last 100 years was recently published (Suresh et al., 2018). This checklist serves as an excellent reference to compare the results of our study. Chilika is a shallow water lagoon with an average depth of about two meters and experiences strong coastal winds. Therefore, we presumed that sampling the surface water should suffice for the objectives of this study as there is no strong vertical stratification in the water column. However, it is advisable to sample from different depths to obtain a better representation of biodiversity in the ecosystem.

2.2 Sample collection

First, we designed a pilot-scale sampling strategy to test the feasibility and reproducibility of our approach. Three equally spaced geolocated stations (S27, S28, S29) were selected on a 10 km transect in the central sector of the lagoon (Supplementary Fig. 2). Next, we designed a spatiotemporally replicated sampling strategy with seven geolocated stations (S1, S6, S14, S17, S26, S29, S30) that are spread across the lagoon to assess the taxonomic diversity (Fig. 2). The pilot-scale sample collection at the three locations was conducted in December 2019 (Winter), while the spatiotemporal sampling was conducted during March (Summer), July (Monsoon), and November (Winter) of 2020. All seven stations were sampled in the Monsoon and three randomly selected stations were sampled in Summer and Winter. As we intended to use PCR-free approaches downstream, a large volume of water had to be filtered to obtain a sufficient yield of extracellular DNA. Therefore, about 10-25 liters of water was filtered in triplicate at each of the geolocated sampling stations using the integrated eDNA sampler by Smithroot Inc. (Thomas et al., 2018). In the turbid areas, multiple filters were used to increase the volume of sampled water to at least 10 liters. Simultaneously, water temperature, salinity, and pH were also measured using a water quality sonde at every sampling station (YSI, Model No. 6600, V2). A 0.45um mixed cellulose ester (MCE) filter membrane was used for water filtration as it has been shown to bind to the free form of extracellular DNA due to its chemical affinity (Liang and Keeley, 2013). The selected pore size allows more volume of water to be filtered before clogging and also retains suspended soil particles such as clay (<2um), silt (2-50um), and sand (50um-2mm) which are adsorbed by extracellular DNA (Nagler et al., 2018). A triplicate sampling module was used to maximize the rate and volume of water sampled per location. We set a maximum vacuum pressure of 10 psi to minimize cell lysis during filtration and maintain a flow rate of less than one liter per minute. The eDNA sampling system also avoids the risk of contamination by utilizing sterile single-use self-preserving filter holders, which were replaced for every sampling location. The self-preserving filter holder comprises a desiccating plastic material that completely removes any traces of water and preserves eDNA for several weeks at room temperature (Thomas et al., 2019). Hence, we transported the filter holders to the laboratory in dark conditions at room temperature.

2.3 Extracellular eDNA enrichment

We adapted a lysis-free phosphate buffer-based DNA extraction protocol to enrich the extracellular eDNA from the filter membranes and minimize the proportion of organismal and extra-organismal DNA (Lever et al., 2015; Liang and Keeley, 2013; Taberlet et al., 2012c). The main principle of the extraction protocol is to desorb the extracellular DNA bound to the surface of the MCE filter membrane and soil particles without lysing the intact cellular and subcellular particles using the saturated phosphate buffer (Nagler et al., 2022). The phosphate groups from the buffer compete with the phosphate groups of the extracellular DNA bound to the surface of soil particles via cation bridging and desorb the DNA by chemical displacement (Taberlet et al., 2012c). The desorbed DNA is then isolated through a column-based DNA isolation protocol using reagents and columns from the Nucleospin soil kit (Macherey-Nagel, Germany). We performed extracellular DNA extraction from the filter membranes in a clean lab within a week of sample collection. The bench surfaces were wiped with 50% diluted commercial bleach, followed by distilled water and 70% ethanol. Filtered tips were used to pipette all the liquids to avoid any aerosol contamination during the extraction. The phosphate buffer was freshly prepared before extraction by mixing 0.197g of NaH2PO4 and 1.47g of Na2HPO4 in 100 ml of DNA-free water (0.12M, pH 8). Filter membranes were carefully taken out of the filter holders and rolled using sterile forceps before placing them into 15 ml falcon tubes containing 5 ml phosphate buffer and large ceramic beads (0.6-0.8mm) from the Nucleospin soil kit. The falcon tubes were shaken for 10-15 minutes by placing them on a vortex mixer with the vertical falcon holder module. The large ceramic beads help homogenize soil clumps recalcitrant to the desorption process but do not disrupt the intact cells. This process is principally different from bead-beating employed in microbiology to lyse the cells. We did not exceed the time of mixing beyond 15 minutes to avoid co-extracting a large proportion of humic acids. The homogenized mixture was immediately centrifuged at 11000 x g to precipitate particulate matter, and the supernatant was passed through the Nucleospin inhibitor removal column. The DNA-binding condition of the flow-through was then adjusted using the binding buffer and then passed through the silica column from the Nucleospin soil kit. DNA was eluted using 150 μl of warm Tris EDTA buffer with three successive elution. We quantified the extracellular eDNA elute using a high-sensitivity double-stranded DNA assay in Qubit 4 (Thermo Fisher Scientific, USA). Samples with less than 20 ng/μl concentration were concentrated until the volume decreased to about 50 μl using the SpeedVac vacuum concentrator (Thermo Fisher Scientific, USA).

2.4 Library preparation

The DNA concentration of the samples was diluted to 20ng/ul, and one microgram of DNA was taken as input for the library preparation. We chose the Illumina Truseq DNA PCR-free library preparation method to avoid PCR-induced artifacts such as substitutions, indels, and chimeras. It also helps to drastically reduce the uninformative duplicate reads that arise from library amplification. We did not include technical replicates for library preparation since there is no stochasticity from amplification due to a completely PCR-free workflow. The input DNA was first randomly sheared into 350 bp fragments using the Covaris ultrasonicator. The ends of the fragmented DNA were repaired and dA-tailed before ligation using unique dual index (UDI) adapters for Illumina (IDT, USA). The adapter-ligated library fragments were size-selected and purified with SPRI beads (Beckman Coulter, USA). The fragment sizes of the libraries were verified using the Agilent Bioanalyzer high-sensitivity DNA chip. It is to be noted that there is no amplification of libraries post-ligation. Therefore, the concentration of libraries was determined using a library quantification kit consisting of known concentrations of standards (Takara Bio, USA) on the ViiA7 real-time quantitative PCR (Applied Biosystems, USA). The failed libraries with less than 1nM concentration, possibly due to residual inhibitors in the input DNA, were excluded from sequencing.

2.5 Sequencing and quality control

We first performed a sequencing saturation analysis to determine the library complexity and target sequencing depth using a sample of 10 million reads from each of the three pilot-scale study samples. Only the forward (R1) pair of the reads were used to count the k-mers. We generated a k-mer frequency histogram with the KMERCOUNTEXACT command in the BBTOOLS package (Bushnell, 2022) using 31 bp k-mers having at least a 95% probability of correctness based on the base quality scores. Using the LC_EXTRAP command in PRESEQ v.3.2, the library complexity was extrapolated using a rational function approximation via continued fractions (Daley and Smith, 2013). The saturation of unique k-mers was estimated with 100 bootstraps, a step size of 100 million k-mers, and a maximum extrapolation of up to 100 billion k-mers. We determined the saturation point where the k-mer uniqueness reached a minimum threshold of 5%. The target sequencing depth was then calculated as the saturation point divided by the number of 31 bp k-mers in a 150 bp read. The concentrations of the extracellular eDNA libraries passing QC were then adjusted to achieve the target sequencing depth and pooled together. The pooled libraries were denatured into single-stranded DNA before loading onto an S4 patterned flow cell and sequenced for 300 cycles in paired-end mode on the Illumina Novaseq 6000 high-throughput short-read platform. The binary base call files in BCL format were demultiplexed and converted into FASTQ format using the Illumina BCL2FASTQ v2.20 software. The sequences with unexpected index combinations due to cross-contamination between multiplexed samples (via tag jumps) remain unclassified during demultiplexing due to the unique dual indexing (UDI) strategy used in the library preparation. The optical duplicates and complementary strand duplicates from the demultiplexed raw reads were filtered out using the CLUMPIFY tool in the BBTOOLS package. The deduplicated sequences containing adapter sequences and low-quality ends (q<10) were trimmed, and sequences shorter than 51 bp or containing more than three uncalled bases (N) after trimming were filtered out using the BBDUK tool in the BBTOOLS package (Bushnell, 2022).

2.6 Taxonomic assignment

We assembled a reference set of protein sequences from all the domains of life using the UniProt reference clusters database v2022_3. The UniRef100 consists of the representative sequences from the entire protein universe with all the redundant sequences and fragments filtered out at 100% identity. Unclassified and artificial sequences were removed and only those UniRef100 sequences with a valid NCBI taxonomic ID under the domains Archaea, Bacteria, Eukaryota, and Viruses were retained. The final set of references was used to build an FM-index of the Burrows-Wheeler transformed sequences compatible with the metagenomic classifier KAIJU (Menzel et al., 2016). The paired-end reads (R1 & R2) were classified separately by querying the six-frame translated reads against the UniRef100 database using the Maximal Exact Matches algorithm implemented in KAIJU. The default parameters of minimum match length (11 aa) and low complexity filtering using the SEG algorithm were retained. We then merged the classification of the respective R1 and R2 reads with the Lowest Common Ancestor algorithm to increase the specificity of the classification of each read pair. Read pairs were left unclassified if one of the reads was not classified under any domains of life or if both the read classifications were based on the same protein fragment. As the majority of species in our ecosystem are not represented in the sequence database, family-level classifications were used for analyzing the diversity within the samples. The total abundance of reads assigned for each classified family was calculated in all the samples. A background rate of classification was also calculated for each family as the total number of reads assigned to its parent order divided by the number of families classified under the parent order. We filtered out the families with less than 1000 classified reads or less than the background rate of classification. We also filtered out any common contaminants showing unusually high abundance and RNA virus families that cannot be directly detected through DNA sequencing. Finally, to assess the effectiveness of the taxonomic assignment, the list of families from the class Actinopteri was compared with the checklist of known fishes of Chilika (Suresh et al., 2018) and the list of proteomes available in UniProt.

2.7 Diversity analysis

We used the incidence-based statistical framework (Colwell et al., 2012) to estimate the asymptotic taxonomic richness using the iNEXT R package (Hsieh et al., 2016). The combined metagenomic data was divided into sampling units of 100 million reads each and the incidence frequencies of each taxon in the sampling units were calculated. We then estimated the asymptotic richness (hill number of 0th order) of various taxa using statistical extrapolation of accumulation curves of the observed incidence frequencies in the sampling units with 100 bootstraps. The Jaccard similarity index was calculated with the Spader R package (Chao and Jost, 2015) using the incidence frequencies of taxa in pairwise samples with a sampling unit size of 10 million reads. Next, we used the R package Phyloseq (McMurdie and Holmes, 2013) to calculate the Bray-Curtis dissimilarity and ordination of the samples. A count matrix of the read counts of all the families found in each sample was generated and converted into a Phyloseq object with the taxonomy and sample metadata. The raw read counts were converted into relative abundances by dividing them by the total classified read count in the respective sample. The Bray-Curtis dissimilarity between the samples was calculated using the distance function of Phyloseq. The dissimilarity matrix was used as input for the non-metric multidimensional scaling (NMDS) method of ordination. The NMDS was run for 20 iterations until the solution was reached with a stress value of less than 0.1. The effect sizes and significance of seasonal and spatial factors were calculated using PERMANOVA implemented in the adonis2 function of the Vegan R package (Oksanen et al., 2022).

3. RESULTS

3.1 Extracellular eDNA is a genetic repertoire of organisms from all the domains of life inhabiting an ecosystem

We investigated whether extracellular eDNA contains detectable amounts of genetic material from organisms across the domains of life by designing and testing a lysis-free and PCR-free workflow (Fig. 1) in a spatiotemporal study conducted at Chilika Lagoon (Fig. 2). Using a random sample of reads from each of the three pilot study samples, the minimum target sequencing depth was estimated to be 416.6 million reads based on the 95% k-mer saturation at about 50 billion observed unique k-mers (Supplementary Fig. 3). Using this as the minimum target sequencing depth, a total of 3.3 trillion bases of data was generated from 16 samples. Upon demultiplexing, we obtained 10.96 billion paired-end reads (150 bp x 2) with a median depth of 658.35 million reads (SD 185.98 million) per sample (Supplementary Table 1). After removing the optical duplicates, complementary strand duplicates, and quality filtering, about 94.37% of the paired-end reads were retained. The high-quality deduplicated paired-end reads were taxonomically classified by querying them against the UniRef100 database containing 162.78 million reference protein sequences from 8539 families of taxa across the tree of life (Supplementary Fig. 4). A total of 6.59 billion reads were classified under all the domains of life with a median classification rate of 64.29% (SD 8.4%) per sample (Supplementary Fig. 5). Remarkably, a significant number of reads were represented in all the domains of life. Overall, the majority of reads were classified under Bacteria (86.95%) followed by Viruses (6.54%), Eukaryota (5.48%), and Archaea (1.01%). This supports our hypothesis that extracellular eDNA is a repertoire of genetic material from organisms across the domains of life inhabiting an ecosystem.