Ultra-deep sequencing enables high-fidelity recovery of biodiversity for bulk arthropod samples without PCR amplification
© Zhou et al.; licensee BioMed Central Ltd. 2013
Received: 19 December 2012
Accepted: 1 March 2013
Published: 27 March 2013
Next-generation-sequencing (NGS) technologies combined with a classic DNA barcoding approach have enabled fast and credible measurement for biodiversity of mixed environmental samples. However, the PCR amplification involved in nearly all existing NGS protocols inevitably introduces taxonomic biases. In the present study, we developed new Illumina pipelines without PCR amplifications to analyze terrestrial arthropod communities.
Mitochondrial enrichment directly followed by Illumina shotgun sequencing, at an ultra-high sequence volume, enabled the recovery of Cytochrome c Oxidase subunit 1 (COI) barcode sequences, which allowed for the estimation of species composition at high fidelity for a terrestrial insect community. With 15.5 Gbp Illumina data, approximately 97% and 92% were detected out of the 37 input Operational Taxonomic Units (OTUs), whether the reference barcode library was used or not, respectively, while only 1 novel OTU was found for the latter. Additionally, relatively strong correlation between the sequencing volume and the total biomass was observed for species from the bulk sample, suggesting a potential solution to reveal relative abundance.
The ability of the new Illumina PCR-free pipeline for DNA metabarcoding to detect small arthropod specimens and its tendency to avoid most, if not all, false positives suggests its great potential in biodiversity-related surveillance, such as in biomonitoring programs. However, further improvement for mitochondrial enrichment is likely needed for the application of the new pipeline in analyzing arthropod communities at higher diversity.
KeywordsNext-generation-sequencing Species richness Abundance Biomonitoring Insect biodiversity Mitochondria PCR-independent Metabarcoding
Given the increasing needs for assessing habitat quality and conserving natural bio-resources, biodiversity composition and its temporal and spatial variations have been evaluated systematically using standardized protocols [1, 2]. National biomonitoring programs have been established globally, such as in the United States, United Kingdom, Australia and Canada [3–6]. Although specific protocols and sampling scales vary across nations, tens of thousands of sampling sites are collected multiple times a year, through which millions of biological specimens are routinely collected, preserved, identified and statistically analyzed . This biological information is used by environmental agencies as the scientific basis for decision-making [2, 3, 5, 8]. However, the major impediment to this application has been the limited capacities in morphological identification of taxonomic diversity for large volumes of biological samples in an accurate and high-throughput manner . Although many diversity analyses employed in biological assessments are of high-quality and credibility, the limited identification capacity may lead to coarse diversity resolution , inconsistent taxonomic delineation across individual researchers and institutes , much reduced sampling scale , and extended turn-around time in routine sample processes .
DNA barcoding, which utilizes a standard gene fragment for species identification, has been widely used to facilitate biodiversity and ecological studies . When the classic DNA barcoding approach, which is optimized based on individual Sanger sequencing, is coupled with next-generation-sequencing (NGS) technologies, the combined method, metabarcoding, shows even greater potential in unveiling molecular characteristics of the entire fauna or flora in question. In particular, metabarcoding has enabled sophisticated analyses of biodiversity in varied environments, ranging from deep-sea meiofauna  to terrestrial insects collected by Malaise traps , while the majority of studies have focused on microbial communities [17, 18]. In addition to the ability to reveal diversity for mixed biological samples, NGS platforms are capable of high-throughput sequencing [19, 20] with short turn-around time (e.g., 24 hours for the Illumina MiSeq and Roche 454 GS FLX + sequencers).
In this study, we aim to develop a new NGS pipeline that is independent of PCR amplifications (Figure 1), while still enabling molecular identification of insects at the species level, using bulk insect samples for the proof of concept. Two major challenges for the complete elimination of PCR amplification need to be resolved first: (a) detection of target DNA sequences at low quantity and (b) taxonomic assignment based on short NGS reads. A 650 bp sequence fragment on the 5′ end of the mtCOI gene has been widely adopted as the DNA barcode region for identifying animal species since its initial proposal . Although mitochondria are found in vast copy numbers in metazoan animals, mitochondria (MT) nucleotides only account for a small fraction of the total DNA compared to nuclear sequences (e.g., 0.05% in Bombyx mori) and sequences of microbial origin in the DNA soup. The ultra-deep sequencing capacity of the Illumina sequencing platform provides an opportunity to examine the feasibility of detecting minute trace of mitochondrial sequences directly from genomic DNA mixtures. For example, each sequencing run of the HiSeq 2000 sequencer is able to produce approximately 600 Gbp, equivalent to 200 human genomes, which is more than 1,000X as much as the capacity of the Roche 454 GS FLX + platform (see review ). On the other hand, the current Illumina sequence reads for HiSeq 2000 can only reach up to 150 bp. This short sequence length presents a limitation to the full utilization of full-length DNA barcodes (approximately 650 bp for animal COI barcodes) available globally (e.g., through the International Barcode of Life initiative ). Although a very short piece of the standard barcodes (“mini-barcode”) of only 130 bp demonstrated reliable taxonomic resolution for a number of animal groups , longer sequences are always preferred for improved identification power. Therefore, we need informatics solutions to assign species-level identity based on mixed short Illumina reads.
To overcome these two major hurdles, the new pipeline employed in the current study involves pre-sequencing enrichment of mitochondria followed by total DNA extraction, shotgun sequencing of isolated total DNA using the Illumina HiSeq 2000, and taxonomic identification of NGS reads. Two different approaches can be applied to assign species-level identity: (1) if an a priori barcode reference library exists for the investigated fauna, NGS reads are mapped to the reference sequences following defined criteria; and (2) when this reference library is absent, de novo assembly of NGS reads into mitochondrial gene fragments, especially the COI barcode region, is employed, followed by gene annotation, to ensure accurate detection of taxa from the mixed bulk sample. Based on the promising results from our in silico simulations using 209 insect mitochondrial genomes obtained from GenBank (details of the design and results of the simulation are provided in Additional file 1: Appendix S1), we apply the new Illumina pipeline to analyze real bulk insect samples. A preliminary study was first performed to gain a basic understanding for the required scale of Illumina sequencing. Details were summarized in Additional file 2: Appendix S2. A formal sample was subsequently sequenced and analyzed with ultra-high sequencing volume (15.5 Gb) to systematically test and discuss this NGS pipeline on biodiversity study in the following text.
The main question of the present work is: can PCR amplifications be avoided in analyzing arthropod biodiversity using the Illumina HiSeq 2000 platform? Specifically, we focus on: (1) how much Illumina sequencing is required with regards to a reliable estimation for species composition from bulk insect samples at a given species setting and (2) what informatics tools can facilitate such analysis? To answer these questions, we use an empirical approach to evaluate the accuracy of species composition estimates (quantifying true positives, false negatives, and false positives). We also measure the impact of two analysis strategies (i.e., the presence or absence of a reference barcode library) for the focal fauna on taxonomic discovery. Additionally, exploiting another potential benefit of abandoning PCR amplifications, we explore the indication for relative abundance of each taxon present in the bulk sample. Lastly, we discuss the pros and cons of this new pipeline and provide suggestions for potential improvements to achieve its real-world application.
Seventy-three insect individuals were collected from a mountainous habitat in the sub-tropical region of China (22°36′01.38”N, 114°16′00.76”E, approximately 340 m above sea level [ASL]) on October 5th 2011 and preserved in 99.5% ethanol at 4°C for one month. Every individual was identified morphologically by authors of the paper to the finest taxonomic level as much as possible. Genomic DNA was first extracted from a single leg of each specimen for DNA barcoding. Standard COI barcodes for all specimens were individually Sanger sequenced and compared against the Barcode of Life Data Systems for taxonomic confirmation. Subsequently, after mitochondrial enrichment using differential centrifugation, total genomic DNA was extracted from homogenized tissues for mixed insect sample. The DNA library of an insert size of 200 bp was constructed for the bulk sample and then sequenced on an Illumina HiSeq 2000 analyzer at BGI (Shenzhen, China) using 100 bp paired-end (PE) sequencing, following the manufacturer’s instruction. The DNA library was analyzed using an ultra-deep sequencing strategy (approximately 15.5 Gb) to examine the impacts of sequencing volume on biodiversity recovery.
Sequences containing adaptor contaminations (with > 15 bp matched to the adapter sequence) and poly-Ns were filtered out. PE reads were removed from subsequent analyses if >10 bases were of low quality scores (<20, i.e., sequencing error rate > 1%). A new dataset containing only high-quality reads was created after these filtering steps, and is available from the GigaScience Database .
Reference barcode library
Sample composition, sequencing information and COI recovery rates of the bulk insect sample
Number of Individuals
Number of COI barcodes obtained
Number of MOTUs (2%)
Raw data size (Gb)
High quality data size (Gb)
Discovery rate (with reference)
Discovery rate (no reference)
Assembly coverage rate (% MT genomes)
Total length and percentage of COI genes 1
Number of assembled mitochondrial genes 2
We developed two independent bioinformatics pipelines for when the COI barcode reference library is available for the focal fauna and when one is not available.
Sequence coverage (percentage of a reference sequence matched by Illumina short reads) was used as a criterion for taxonomic detection (Additional file 1: Appendix S1, Additional file 3: Figure S2). Sequence coverage was calculated by aligning full length PE reads to the reference. A reference sequence was considered matched only when > 90% nucleotides were covered by query reads at ≥ 99% identity (Additional file 3: Figure S2).
By aligning high-quality reads to the Sanger library, we detected 36 MOTUs out of 37 references (97.3%). The only exception was one species (OTU51) that was represented by a single specimen with a body length < 2 mm (Additional file 4: Table S2, Additional file 3: Figure S3).
Reference independent method
Results of de novo assembly for the 35 detected COI scaffolds and 370 mitochondrial scaffolds
Length of COI scaffolds (bp)
Length of Mitochondrial scaffolds (bp)
Impacts of sequencing volume on discoveries of taxonomic richness
Between the reference-based and reference independent methods (Figure 3A), it was also obvious that the increase of sequencing volume made larger improvement on the latter, where sequence assembly was a critical step in the bioinformatics pipeline. Our experience showed that an average sequencing depth of 10X would generally produce a good assembly. In fact, the number of MOTUs with sequencing depth < 10X was decreased from 24 to 9 at 1 Gb and 13 Gb, respectively.
Correlation between sequencing volume and biomass
Ultra-deep sequencing enables detection for small trace of mitochondrial sequences
NGS technologies have been employed in biodiversity analyses for varied environments, with proposed advantages in throughput and cost. However, most of these studies have been based on PCR amplifications, especially in eukaryotes, which typically possess large-sized genomes. Although the incorporation of PCR amplifications has the obvious advantage in producing sufficient amplicons of the targeted gene fragments, such practice often introduces taxonomic biases due to varying primer efficiencies across taxa and sequence errors caused by mismatches of complementary strains to the DNA templates [7, 21, 23, 26, 28, 47]. The main challenge in reducing these biases by eliminating PCR is to ensure that the community diversity can be accurately measured through the small amount of mitochondrial sequences in the mixed DNA soup. In this study, this objective is achieved through mitochondrial enrichment and deep sequencing.
The mitochondrial enrichment protocol employed in this study clearly has room for improvement. The whole enrichment procedure yielded mitochondrial DNA accounting for only 0.53% of the total raw data (15.5 Gb), which was approximately 10X higher than the original judging by the corresponding MT sequence proportion found in the silk worm (0.05% ). A much smaller percentage (0.03%) belonged to COI genes. However, even with this small fraction of the total sequence data, 74% of the total mitochondrial genomes presented in the bulk sample had been successfully assembled, while 96% of the total COI sequences had been covered (Table 1) – a clear benefit of the deep sequencing capacity of Illumina’s HiSeq 2000 sequencer. This minute proportion of mitochondrial sequences contains almost 5 million base pairs of high-quality data belonging to COI genes, equivalent to >7,000 full-length DNA barcodes, which is almost comparable to the total raw data capacity of an entire 454 run. On average, each of the 37 member species in the insect community has been covered by more than 200 full-length barcodes equivalence of sequence reads. It is this high sequence coverage that has ensured species recovery at high accuracy.
PCR-independent method delivers species recovery at high fidelity
These COI sequences detected by our NGS pipeline enabled recoveries for 97% (with a reference barcode library) and 92% (without reference) of the total taxa in the bulk insect sample. To our knowledge, these results represent the highest rates of “true positive” discoveries in all published NGS analyses for arthropod diversity that has a controlled reference.
All taxa missed by the NGS pipeline (false negatives) are characterized by low biomass and are subsequently covered by only low sequencing depth. Only 1 and 3 taxa were not detected by the reference-based and reference-independent methods, respectively. All missed specimens had a body length < 5 mm (Additional file 3: Figure S5).
Our pipeline, which eliminates PCR amplification, rarely picks up novel taxa, and thus avoids the “taxon inflation” common to other NGS-based biodiversity assessments [16, 33]. Only 1 detected taxon identified molecularly as Lepidoptera was found absent in the reference barcode library. Based on its good assembly quality (yet low read coverage) and highly congruent amino acid sequence composition compared to lepidopteran barcodes in BOLD, we argue that this novel sequence is not a real ‘false positive’. The exact source for this ‘novel taxon’ is unclear, but possibilities might include gut content, undetected tissue (damaged pieces, eggs, small body-size, etc.) in the insect mixture, environmental DNA and so on. The extremely low rate of novel taxa is due to the stringent algorithm involved in the matching criteria based on sequence coverage (reference-based) and de novo assembly of Illumina shotgun reads (reference independent), which eliminates nearly all sequence errors.
A potential way to obtain relative abundance?
The correlation between sequencing volume (the number of nucleotides) and biomass of a given species has not only produced biological production information of the insect community, but also provided a possible solution to investigating relative abundance of each arthropod taxon present in the bulk sample. This possibility relies on the availability of a reference barcode library for the focal fauna and a database for the range of biomass of each of the arthropod species in this fauna. Presumably, the number of individuals of a given species can be estimated by the total biomass – that was calculated from the total nucleotide base pairs of this species – divided by an average biomass of the target species. Apparently, a well-designed test, including a wider range of taxa with varied biomasses, is needed to draw solid conclusion on the feasibility of this approach, which is beyond the scope of this paper. Nevertheless, our new NGS protocol, especially the elimination of PCR amplifications, has created an alternative way to quantify abundance (although as a statistical estimation) – the critical information concerned in ecological studies.
Further improvements for applications in real-world scenario
Admittedly, the insect bulk sample tested in the present study only represents biodiversity examples at a moderate level. Although our diversity scale is comparable to some recent work (e.g., [33, 48]), community samples consisting of more complex components (e.g., more than hundreds of species from wider taxonomic ranges) can be expected in many real-world sampling efforts. Nonetheless, our work provides an invaluable first-hand knowledge on the requirement of sequencing volume when PCR is avoided. Based on our findings in the correlations between sequencing volume and discoveries of MOTU and biomass (Figure 3), a simple extrapolation suggests that it may require multiple lanes of Illumina HiSeq data to handle bulk insect samples containing hundreds of species. Therefore, a more efficient mitochondrial enrichment procedure is desired to enlarge the sampling capacity per Illumina lane. Provided that the current sequencing cost for Hiseq 2000 (based on the list price from the manufacturer) is approximately $40 per gigabase, the sequence yield per run is 600 Gb , and the sequencing volume needed to reveal insect richness from bulk samples (based on empirical data of this paper), we calculate the average cost for discovering a single insect species is less than $20. This cost is already very close to that of classic DNA barcoding of an individual specimen using Sanger sequencing. Given the trajectory of NGS cost reduction over the past years, it is reasonable to expect wide adoption of a PCR-free Illumina shotgun approach in routine biodiversity studies.
Furthermore, alternative protocols for DNA preservation and mitochondrial isolation are expected to greatly increase the portion of mtDNA sequences in the total DNA sample, which can be translated into less sequencing and lower cost. Preservation media, such as DESS , that help to maintain the integrity of mitochondria and circular DNA, coupled with enzymes digesting linear DNA may help to vastly remove nuclear and even bacterial DNA. A DNA-capture-based method might perform better for small volume of targeted DNA . While our mitochondrion enrichment protocol is able to handle specimens preserved in ethanol for 1 month, a DNA-capture-based method might be more appropriate for older specimens with more highly degraded DNA. For bulk samples containing arthropods with a large range of body sizes, a pre-sorting of large samples from smaller ones followed by sub-sampling and re-mixing (tissue normalizing) should help to improve the detection of small individuals, given the total sequencing volume remains at the same level.
It is also clear that a comprehensive DNA barcode library and a genomic database in general will significantly improve our analytical efficiency in the NGS analysis for biodiversity composition. These databases will not only improve species detection rates (reference based vs. reference independent methods) for arthropods, but also help to reach a better understanding of the presence of the much broader diversity (e.g., algae, bacteria, virus) in the bulk sample.
In summary, the ultra-high sequencing capacity of the Illumina HiSeq 2000 platform provides a new NGS solution that avoids the use of PCR amplifications of particular gene markers in arthropod biodiversity analysis. This new pipeline not only reveals species richness in high fidelity, but also creates a possibility to reveal relative abundance of each taxon present in the bulk sample. The ability to detect small arthropod specimens and its tendency to avoid most, if not all, false positives has suggested its great potential in biodiversity related surveys, such as in biomonitoring programs. However, it is crucial to improve mitochondrial enrichment to ensure applications of the new NGS pipeline in analyzing more complex biodiversity settings.
In this study, we chose insects as the model system to test our new pipeline because they had been widely used in environmental quality assessments and biomonitoring programs . Insect samples were collected using a 15 W black-light trap (Bioquip, CA, USA) and a white pan filled with 99.5% ethanol. A mountainous habitat, Beishan, close to the BGI’s headquarters in Shenzhen, China, was sampled on October 2 (Preliminary Sample, 22°35′38.94”N, 114°15′54.64”E, approximately 55 m ASL) and October 5 (Formal Sample, 22°36′01.38”N, 114°16′00.76”E, approximately 340 m ASL) in 2011. The two collecting sites differed around 290 m in elevation with similar vegetation. A total of 89 and 73 individuals were collected, respectively, and preserved as two separate bulk samples at 4°C in the laboratory. All specimens were individually barcoded, photographed and measured for body-length. Taxonomic identification was achieved using morphological characters and by comparisons of COI barcodes against the Barcode of Life Data Systems .
Mitochondrial DNA enrichment and extraction
Mitochondrial enrichment was conducted using differential centrifugation  with protocols optimized for mixed insects. Bulk samples were preserved in 99.5% ethanol at 4°C for 1 week (Preliminary Sample) and 1 month (Formal Sample) before isolation. All insects were removed from ethanol and washed with chilled 1X MS buffer (210 mM mannitol, 70 mM sucrose, 5 mM TrisHCl, 1 mM EDTA) to remove residual ethanol. Large insects were firstly sliced into small pieces to facilitate the homogenization process. All individuals were then pooled and homogenized using an IKA T-10 basic homogenizer (IKA, Germany). To alleviate mitochondrial deterioration, homogenation was performed in chilled 1X MS buffer (10 times larger than the volume of the total tissue sample) in a 50 mL polypropylene conical tube for 5 minutes. The homogenate was then centrifuged at 1,300 g at 4°C to remove nucleic and cellular debris. Collected supernatant was centrifuged at 17,000 g for 30 min to enrich mitochondria.
The isolated mitochondria were lysed in the mitochondrion lysis buffer (0.15 M NaCl, 10 mM TrisHCl, 1 mM EDTA), with 5% SDS and 0.5 mg/ml proteinase K. The extracted total DNA was purified using phenol-chloroform and isoamyl alcohol mixture. DNA was isolated using NH4Ac and absolute alcohol. Finally, DNA was dissolved in TE buffer (10 mM Tris HCl, 1 mM EDTA).
Construction of DNA barcode references
Two legs of each specimen were removed for genomic DNA extraction and subsequent acquisition of COI barcodes, with the rest of body tissues saved for NGS analyses. A DNA release method with Cell Lysis Solution and Proteinase K was used to release genomic DNA , following standard barcoding protocols used by the Canadian Center for DNA Barcoding . Two different primer sets were used in a two-tiered PCR amplification: LepF1/LepR1 , followed by LCO1490/HCO2198  when the first round failed. Individual DNA was amplified with a 25 μl-volume reaction, containing 1 μl of genomic DNA, 1 μl of each forward and reverse primers, 16.2 μl of H2O, 2.5 μl of 10X buffer, 3 μl of dNTP mix, and 0.3 μl of Ex-taq DNA polymerase (TaKaRa Biosystems). PCR products were visualized using 2% agarose gels. Positive PCR products were then Sanger sequenced using an ABI 3730XL DNA sequencer at BGI.
A NJ tree was built for all Sanger sequences using MEGA V5.0  and Interactive Tree Of Life  with a Kimura-2-Parameter setting and pair-wise distances. This COI tree based on sequence similarity shows sequence divergences within and among MOTUs and roughly represents taxonomic diversity.
Coverage calculation of reference based method
COI sequences obtained in this study via individual Sanger sequencing were used as the reference for taxonomic assignments of the short Illumina reads. Sanger references were first clustered into MOTUs using an arbitrary threshold of 2%. Shotgun reads were then aligned to the Sanger references using the program BLASTN , allowing 99% identity and full length alignment of the PE reads. The qualified reads was then used to calculate the overall coverage of reference barcodes using SOAP coverage . A reference sequence was considered matched only when > 90% coverage was reached.
Assembly and annotation of reference independent method
High quality Illumina reads (including non-COI mitochondrial reads) were assembled into contigs and scaffolds using the genome assembler program SOAPdenovo2 [45, 46], independent of a COI library. Through the construction of de Bruijn graphs, contigs were built using a k-mer with a size of 61 bp and resolving repeats by reads. Reads were then aligned to contigs with a 45 bp k-mer and unmasking contigs with high coverage. The paired-end information embedded in sequence reads was also referred in scaffold constructions.
All mitochondrial protein-coding genes, including the COI barcode region, were annotated using homolog prediction. Assembled scaffolds were aligned to a set of 244 complete mitochondrial genomes obtained from GenBank using TBLASTN with an e-value ≤10−5. The BLAST results were then used to determine gene ontology (e.g., mRNA and coding sequence regions) using Genewise , which was confirmed by DOGMA . Ribosomal RNA genes were annotated using BLAST by comparing to the 244 mitochondrial genomic rRNA database with an e-value ≤10−5 and annotated when its length was ≥ 200 bp or ≥ 300 bp, for the 12S or 16S rRNAs, respectively. Genes that could not be annotated through these steps were then manually annotated by aligning to corresponding gene sequences of the 244 mitochondrial genomes and by examining amino acid translation. For example, the ATP8 gene is relatively short (typically ~159 bp) and is difficult for automated annotation. In addition to the ATP8 genes obtained using the homolog prediction method, we further aligned regions between annotated COX2 and ATP6 and manually identified more ATP8 genes. The highly conservative protein sequences translated from all annotated ATP8 genes confirmed both assembly and annotation.
The accuracy of assembly was validated by PCR and Sanger sequencing for a set of randomly selected genes assembled and annotated from the preliminary sample (Additional file 2: Appendix S2, Additional file 3: Figure S6). All Sanger results were identical to the assembly. PCR validation was only conducted on preliminary sample because the formal sample was expected to contain assemblies of even higher quality due to its much larger volume of data.
Biomass estimation and the correlation analysis
Different biomass equations [64–66] were utilized to calculate biomass based on body size. No significant differences were observed among these methods based on R2 and P value of coefficients (Additional file 3: Figure S4). In the main text, we presented biomass estimation using one of these equations: ln(Y) = ln(a) + b*ln(X), where X represented length*width and Y referred to biomass; a, b were parameters for different insect orders . Body sizes were measured using a metric scale set beside the insect specimen during photographing. Biomass was categorized into different orders. We plotted the sequence volume (in base pairs) of a given species to the total biomass of that species (including all individuals of the same species). To examine the correlation between sequencing volume and biomass, correlation analysis was performed on the datasets. Results were tested using F statistic of regression equation and t statistic of regression coefficient test. The P value of both statistics were < 0.001, reaching a significant level.
Availability and requirements
This pipeline has been adapted from SOAPdenovo2  and therefore distributed under the same license terms.
Project name: zhou2013 (PCR-free pipeline for DNA metabarcoding)
Project home page: https://github.com/gigascience/papers/tree/master/zhou2013
Operating system(s): Unix, Linux, Mac
Programming language: PERL
Other requirements: GCC version ≥ 4.4.5
License: GNU General Public License version 3.0 (GPLv3)
Any restrictions to use by non-academics: none
Availability of supporting data
Above sea level
Cytochrome c oxidase subunit 1
Molecular operational taxonomic unit
Next generation sequencing
Operational taxonomic units
Polymerase chain reaction.
This study is supported by the National High-tech Research and Development Project (863) of China (2012AA021601), by grant to Shenzhen Key Laboratory of Environmental Microbial Genomics and Application, and by funding provided by the China National GeneBank and BGI. We thank Zheng Huang of BGI for her assistance in preparing the figures. We also thank the editors and three reviewers for their comments.
- Bonada N, Prat N, Resh VH, Statzner B: Developments in aquatic insect biomonitoring: a comparative analysis of recent approaches. Annu Rev Entomol. 2006, 51: 495-523. 10.1146/annurev.ento.51.110104.151124.View ArticlePubMed
- Rosenberg DM, Resh VH: An introduction to freshwater biomonitoring and benthic invertebrates. Freshwater Biomonitoring and Benthic Macroinvertebrates. Edited by: Rosenberg D, Resh V. 1993, New York: Chapman and Hall, 1-9.
- USEPA: Summary of biological assessment programs and biocriteria development for states, tribes, territories, and interstate commissions: streams and wadeable rivers. EPA 822-R-02-048. 2002, Washington, DC: US Environmental Protection Agency, Office of Environmental Information and Office of Water
- Brereton T, Roy D, Middlebrook I, Botham M, Warren M: The development of butterfly indicators in the United Kingdom and assessments in 2010. J Insect Conserv. 2011, 15: 139-151. 10.1007/s10841-010-9333-z.View Article
- Saintilan N, Imgraben S: Principles for the monitoring and evaluation of wetland extent, condition and function in Australia. Environ Monit Assess. 2012, 184: 595-606. 10.1007/s10661-011-2405-z.View ArticlePubMed
- Carter JL, Resh VH, Rosenberg DM, Reynoldson TB: Biological monitoring of rivers. Biomonitoring in North American rivers: a comparison of methods used for benthic macroinvertebrates in Canada and the United States. 2006, John Wiley & Sons, Ltd, 203-228.
- Carter JL, Resh VH: After site selection and before data analysis: sampling, sorting, and laboratory procedures used in stream benthic macroinvertebrate monitoring programs by USA state agencies. J N Am Bentholl Soc. 2001, 20: 658-682. 10.2307/1468095.View Article
- USEPA: Wadeable streams assessment: a collaborative survey of the Nation’s streams. EPA 841-B-06-002. 2006, US Environmental Protection Agency Washington, Office of Research and Development, Office of Water, DC
- Baird DJ, Hajibabaei M: Biomonitoring 2.0: a new paradigm in ecosystem assessment made possible by next-generation DNA sequencing. Mol Ecol. 2012, 21: 2039-2044. 10.1111/j.1365-294X.2012.05519.x.View ArticlePubMed
- Resh VH, Unzicker JD: Water quality monitoring and aquatic organisms: the importance of species identification. J Water Pollut Control Fed. 1975, 47: 9-19.PubMed
- Haase P, Murray-Bligh J, Lohse S, Pauls S, Sundermann A, Gunn R, Clarke R: Assessing the impact of errors in sorting and identifying macroinvertebrate samples. The Ecological Status of European Rivers: Evaluation and Intercalibration of Assessment Methods. 2006, 188: 505-521. 10.1007/978-1-4020-5493-8_34.
- Pfrender ME, Hawkins CP, Bagley M, Courtney GW, Creutzburg BR, Epler JH, Fend S, Ferrington LC, Hartzell PL, Jackson S: Assessing macroinvertebrate biodiversity in freshwater ecosystems: advances and challenges in DNA-based approaches. Q Rev Biol. 2010, 85: 319-340. 10.1086/655118.View ArticlePubMed
- Hebert PDN, Cywinska A, Ball SL: Biological identifications through DNA barcodes. Proc R Soc Lond B Biol Sci. 2003, 270: 313-321. 10.1098/rspb.2002.2218.View Article
- Taberlet P, Coissac E, Hajibabaei M, Rieseberg LH: Environmental DNA. Mol Ecol. 2012, 21: 1789-1793. 10.1111/j.1365-294X.2012.05542.x.View ArticlePubMed
- Lecroq B, Lejzerowicz F, Bachar D, Christen R, Esling P, Baerlocher L, Osteras M, Farinelli L, Pawlowski J: Ultra-deep sequencing of foraminiferal microbarcodes unveils hidden richness of early monothalamous lineages in deep-sea sediments. Proc Natl Acad Sci U S A. 2011, 108: 13177-13182. 10.1073/pnas.1018426108.PubMed CentralView ArticlePubMed
- Yu DW, Ji Y, Emerson BC, Wang X, Ye C, Yang C, Ding Z: Biodiversity soup: metabarcoding of arthropods for rapid biodiversity assessment and biomonitoring. Methods Ecol Evol. 2012, 3: 613-623. 10.1111/j.2041-210X.2012.00198.x.View Article
- Huse SM, Dethlefsen L, Huber JA, Welch DM, Relman DA, Sogin ML: Exploring microbial diversity and taxonomy using SSU rRNA hypervariable tag sequencing. PLoS Genet. 2008, 4: e1000255-10.1371/journal.pgen.1000255.PubMed CentralView ArticlePubMed
- Caporaso JG, Lauber CL, Walters WA, Berg-Lyons D, Lozupone CA, Turnbaugh PJ, Fierer N, Knight R: Global patterns of 16S rRNA diversity at a depth of millions of sequences per sample. Proc Natl Acad Sci U S A. 2011, 108 (Suppl 1): 4516-4522.PubMed CentralView ArticlePubMed
- Caporaso JG, Lauber CL, Walters WA, Berg-Lyons D, Huntley J, Fierer N, Owens SM, Betley J, Fraser L, Bauer M: Ultra-high-throughput microbial community analysis on the Illumina HiSeq and MiSeq platforms. ISME J. 2012, 6: 1621-1624. 10.1038/ismej.2012.8.PubMed CentralView ArticlePubMed
- Shokralla S, Spall JL, Gibson JF, Hajibabaei M: Next-generation sequencing technologies for environmental DNA research. Mol Ecol. 2012, 21: 1794-1805. 10.1111/j.1365-294X.2012.05538.x.View ArticlePubMed
- Bellemain E, Carlsen T, Brochmann C, Coissac E, Taberlet P, Kauserud H: ITS as an environmental DNA barcode for fungi: an in silico approach reveals potential PCR biases. BMC Microbiol. 2010, 10: 189-10.1186/1471-2180-10-189.PubMed CentralView ArticlePubMed
- Brown DS, Jarman SN, Symondson WO: Pyrosequencing of prey DNA in reptile faeces: analysis of earthworm consumption by slow worms. Mol Ecol Resour. 2012, 12: 259-266. 10.1111/j.1755-0998.2011.03098.x.View ArticlePubMed
- Deagle BE, Kirkwood R, Jarman SN: Analysis of Australian fur seal diet by pyrosequencing prey DNA in faeces. Mol Ecol. 2009, 18: 2022-2038. 10.1111/j.1365-294X.2009.04158.x.View ArticlePubMed
- Gomez-Alvarez V, Teal TK, Schmidt TM: Systematic artifacts in metagenomes from complex microbial communities. ISME J. 2009, 3: 1314-1317. 10.1038/ismej.2009.72.View ArticlePubMed
- Sharpton TJ, Riesenfeld SJ, Kembel SW, Ladau J, O’Dwyer JP, Green JL, Eisen JA, Pollard KS: PhylOTU: a high-throughput procedure quantifies microbial community diversity and resolves novel taxa from metagenomic data. PLoS Comput Biol. 2011, 7: e1001061-10.1371/journal.pcbi.1001061.PubMed CentralView ArticlePubMed
- Claesson MJ, Wang Q, O’Sullivan O, Greene-Diniz R, Cole JR, Ross RP, O’Toole PW: Comparison of two next-generation sequencing technologies for resolving highly complex microbiota composition using tandem variable 16S rRNA gene regions. Nucleic Acids Res. 2010, 38: e200-10.1093/nar/gkq873.PubMed CentralView ArticlePubMed
- Tedersoo L, Nilsson RH, Abarenkov K, Jairus T, Sadam A, Saar I, Bahram M, Bechem E, Chuyong G, Kõljalg U: 454 Pyrosequencing and Sanger sequencing of tropical mycorrhizal fungi provide similar results but reveal substantial methodological biases. New Phytol. 2010, 188: 291-301. 10.1111/j.1469-8137.2010.03373.x.View ArticlePubMed
- Arif IA, Khan HA, Al Sadoon M, Shobrak M: Limited efficiency of universal mini-barcode primers for DNA amplification from desert reptiles, birds and mammals. Genet Mol Res. 2011, 10: 3559-3564. 10.4238/2011.October.31.3.View ArticlePubMed
- Taberlet P, Gielly L, Pautou G, Bouvet J: Universal primers for amplification of three non-coding regions of chloroplast DNA. Plant Mol Biol. 1991, 17: 1105-1109. 10.1007/BF00037152.View ArticlePubMed
- Folmer O, Black M, Hoeh W, Lutz R, Vrijenhoek R: DNA primers for amplification of mitochondrial cytochrome c oxidase subunit I from diverse metazoan invertebrates. Mol Mar Biol Biotechnol. 1994, 3: 294-299.PubMed
- Ivanova NV, Zemlak TS, Hanner RH, Hebert PDN: Universal primer cocktails for fish DNA barcoding. Mol Ecol Notes. 2007, 7: 544-548. 10.1111/j.1471-8286.2007.01748.x.View Article
- Coissac E, Riaz T, Puillandre N: Bioinformatic challenges for DNA metabarcoding of plants and animals. Mol Ecol. 2012, 21: 1834-1847. 10.1111/j.1365-294X.2012.05550.x.View ArticlePubMed
- Hajibabaei M, Shokralla S, Zhou X, Singer GA, Baird DJ: Environmental barcoding: a next-generation sequencing approach for biomonitoring applications using river benthos. PLoS One. 2011, 6: e17497-10.1371/journal.pone.0017497.PubMed CentralView ArticlePubMed
- Rougerie R, Smith MA, Fernandez-Triana J, Lopez-Vaamonde C, Ratnasingham S, Hebert PD: Molecular analysis of parasitoid linkages (MAPL): gut contents of adult parasitoid wasps reveal larval host. Mol Ecol. 2010, 20: 179-186.View ArticlePubMed
- Li Y, Zhou X, Feng G, Hu H, Niu L, Hebert PD, Huang D: COI and ITS2 sequences delimit species, reveal cryptic taxa and host specificity of fig-associated Sycophila (Hymenoptera, Eurytomidae). Mol Ecol Resour. 2010, 10: 31-40. 10.1111/j.1755-0998.2009.02671.x.View ArticlePubMed
- Porazinska DL, Sung W, Giblin-Davis RM, Thomas WK: Reproducibility of read numbers in high-throughput sequencing analysis of nematode community composition and structure. Mol Ecol Resour. 2010, 10: 666-676.View ArticlePubMed
- Soininen EM, Valentini A, Coissac E, Miquel C, Gielly L, Brochmann C, Brysting AK, Sønstebø JH, Ims RA, Yoccoz NG: Analysing diet of small herbivores: the efficiency of DNA barcoding coupled with high-throughput pyrosequencing for deciphering the composition of complex plant mixtures. Front Zool. 2009, 6: 16-10.1186/1742-9994-6-16.PubMed CentralView ArticlePubMed
- Kowalczyk R, Taberlet P, Coissac E, Valentini A, Miquel C, Kamiński T, Wójcik JM: Influence of management practices on large herbivore diet—case of european bison in białowieża primeval forest (Poland). For Ecol Manage. 2011, 261: 821-828. 10.1016/j.foreco.2010.11.026.View Article
- Taberlet P, Coissac E, Pompanon F, Brochmann C, Willerslev E: Towards next‐generation biodiversity assessment using DNA metabarcoding. Mol Ecol. 2012, 21: 2045-2050. 10.1111/j.1365-294X.2012.05470.x.View ArticlePubMed
- Pond SK, Wadhawan S, Chiaromonte F, Ananda G, Chung WY, Taylor J, Nekrutenko A: Windshield splatter analysis with the galaxy metagenomic pipeline. Genome Res. 2009, 19: 2144-2153. 10.1101/gr.094508.109.View Article
- Xia Q, Guo Y, Zhang Z, Li D, Xuan Z, Li Z, Dai F, Li Y, Cheng D, Li R: Complete resequencing of 40 genomes reveals domestication events and genes in silkworm (Bombyx). Science. 2009, 326: 433-436. 10.1126/science.1176620.PubMed CentralView ArticlePubMed
- International Barcode of Life initiative.http://ibol.org/.
- Meusnier I, Singer GAC, Landry JF, Hickey DA, Hebert PDN, Hajibabaei M: A universal DNA mini-barcode for biodiversity analysis. BMC Genomics. 2008, 9: 214-10.1186/1471-2164-9-214.PubMed CentralView ArticlePubMed
- Zhou X, Li Y, Liu S, Yang Q, Su X, Zhou L, Tang M, Fu R, Li J: Raw data, assembly and annotation results for: “Ultra-deep sequencing enables high-fidelity recovery of biodiversity for bulk arthropod samples without PCR amplification”. GigaScience. 2013, Database http://dx.doi.org/10.5524/100045
- Li R, Zhu H, Ruan J, Qian W, Fang X, Shi Z, Li Y, Li S, Shan G, Kristiansen K: De novo assembly of human genomes with massively parallel short read sequencing. Genome Res. 2010, 20: 265-272. 10.1101/gr.097261.109.PubMed CentralView ArticlePubMed
- Luo R, Liu B, Xie Y, Li Z, Huang W, Yuan J, He G, Chen Y, Pan Q, Liu Y: SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. GigaScience. 2012, 1: 18-PubMed CentralView ArticlePubMed
- Frank JA, Reich CI, Sharma S, Weisbaum JS, Wilson BA, Olsen GJ: Critical evaluation of two primers commonly used for amplification of bacterial 16S rRNA genes. Appl Environ Microbiol. 2008, 74: 2461-2470. 10.1128/AEM.02272-07.PubMed CentralView ArticlePubMed
- Hajibabaei M, Spall JL, Shokralla S, van Konynenburg S: Assessing biodiversity of a freshwater benthic macroinvertebrate community through non-destructive environmental barcoding of DNA from preservative ethanol. BMC Ecol. 2012, 12: 28-10.1186/1472-6785-12-28.PubMed CentralView ArticlePubMed
- Quail MA, Smith M, Coupland P, Otto TD, Harris SR, Connor TR, Bertoni A, Swerdlow HP, Gu Y: A tale of three next generation sequencing platforms: comparison of Ion torrent, pacific biosciences and illumina MiSeq sequencers. BMC Genomics. 2012, 13: 341-10.1186/1471-2164-13-341.PubMed CentralView ArticlePubMed
- Fonseca VG, Carvalho GR, Sung W, Johnson HF, Power DM, Neill SP, Packer M, Blaxter ML, Lambshead PJD, Thomas WK: Second-generation environmental sequencing unmasks marine metazoan biodiversity. Nat Commun. 2010, 1: 98-10.1038/ncomms1095.PubMed CentralView ArticlePubMed
- Mason VC, Li G, Helgen KM, Murphy WJ: Efficient cross-species capture hybridization and next-generation sequencing of mitochondrial genomes from noninvasively sampled museum specimens. Genome Res. 2011, 21: 1695-1704. 10.1101/gr.120196.111.PubMed CentralView ArticlePubMed
- McGeoch MA: The selection, testing and application of terrestrial insects as bioindicators. Biol Rev Camb Philos Soc. 1998, 73: 181-201. 10.1017/S000632319700515X.View Article
- Barcode of Life Data Systems (BOLD).http://www.boldsystems.org,
- Tamura K, Aotsuka T: Rapid isolation method of animal mitochondrial DNA by the alkaline lysis procedure. Biochem Genet. 1988, 26: 815-819.View ArticlePubMed
- Hajibabaei M, Ivanova NV, Ratnasingham S, Dooh RT, Kirk SL, Mackie PM, Hebert PDN: Critical factors for assembling a high volume of DNA barcodes. Phil Trans Biol Sci. 2005, 360: 1959-1967. 10.1098/rstb.2005.1727. 56View Article
- Canadian Center for DNA Barcoding (CCDB).http://www.dnabarcoding.ca/pa/ge/research/protocols.
- Hebert PDN, Penton EH, Burns JM, Janzen DH, Hallwachs W: Ten species in one: DNA barcoding reveals cryptic species in the neotropical skipper butterfly astraptes fulgerator. Proc Natl Acad Sci U S A. 2004, 101: 14812-10.1073/pnas.0406166101.PubMed CentralView ArticlePubMed
- Tamura K, Peterson D, Peterson N, Stecher G, Nei M, Kumar S: MEGA5: molecular evolutionary genetics analysis using maximum likelihood, evolutionary distance, and maximum parsimony methods. Mol Biol Evol. 2011, 28: 2731-2739. 10.1093/molbev/msr121.PubMed CentralView ArticlePubMed
- Letunic I, Bork P: Interactive Tree Of Life v2: online annotation and display of phylogenetic trees made easy. Nucleic Acids Res. 2011, 39: W475-W478. 10.1093/nar/gkr201.PubMed CentralView ArticlePubMed
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol. 1990, 215: 403-410.View ArticlePubMed
- Li R, Yu C, Li Y, Lam TW, Yiu SM, Kristiansen K, Wang J: SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics. 1966, 2009: 25-
- Birney E, Clamp M, Durbin R: GeneWise and genomewise. Genome Res. 2004, 14: 988-995. 10.1101/gr.1865504.PubMed CentralView ArticlePubMed
- Wyman SK, Jansen RK, Boore JL: Automatic annotation of organellar genomes with DOGMA. Bioinformatics. 2004, 20: 3252-10.1093/bioinformatics/bth352.View ArticlePubMed
- Gruner DS: Regressions of length and width to predict arthropod biomass in the Hawaiian Islands. Pac Sci. 2003, 57: 325-336. 10.1353/psc.2003.0021.View Article
- Ganihar S: Biomass estimates of terrestrial arthropods based on body length. J Biosci. 1997, 22: 219-224. 10.1007/BF02704734.View Article
- Hódar J: The use of regression equations for the estimation of prey length and biomass in diet studies of insectivore vertebrates. Miscellània Zoologica. 1997, 20: 1-10.
- Zhou X, Li Y, Liu S, Yang Q, Su X, Zhou L, Tang M, Fu R, Li J, Huang Q: Software and supporting material for: “Ultra-deep sequencing enables high-fidelity recovery of biodiversity for bulk arthropod samples without PCR amplification”. GigaScience. 2013,http://dx.doi.org/10.5524/100046.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.