SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler
- Ruibang Luo†1, 2,
- Binghang Liu†1, 2,
- Yinlong Xie†1, 2, 3,
- Zhenyu Li1, 2,
- Weihua Huang1,
- Jianying Yuan1,
- Guangzhu He1,
- Yanxiang Chen1,
- Qi Pan1,
- Yunjie Liu1,
- Jingbo Tang1,
- Gengxiong Wu1,
- Hao Zhang1,
- Yujian Shi1,
- Yong Liu1,
- Chang Yu1,
- Bo Wang1,
- Yao Lu1,
- Changlei Han1,
- David W Cheung2,
- Siu-Ming Yiu2,
- Shaoliang Peng4,
- Zhu Xiaoqian4,
- Guangming Liu4,
- Xiangke Liao4,
- Yingrui Li1, 2,
- Huanming Yang1,
- Jian Wang1,
- Tak-Wah Lam2Email author and
- Jun Wang1Email author
© Luo et al.; licensee BioMed Central Ltd. 2012
Received: 24 July 2012
Accepted: 10 December 2012
Published: 27 December 2012
There is a rapidly increasing amount of de novo genome assembly using next-generation sequencing (NGS) short reads; however, several big challenges remain to be overcome in order for this to be efficient and accurate. SOAPdenovo has been successfully applied to assemble many published genomes, but it still needs improvement in continuity, accuracy and coverage, especially in repeat regions.
To overcome these challenges, we have developed its successor, SOAPdenovo2, which has the advantage of a new algorithm design that reduces memory consumption in graph construction, resolves more repeat regions in contig assembly, increases coverage and length in scaffold construction, improves gap closing, and optimizes for large genome.
Benchmark using the Assemblathon1 and GAGE datasets showed that SOAPdenovo2 greatly surpasses its predecessor SOAPdenovo and is competitive to other assemblers on both assembly length and accuracy. We also provide an updated assembly version of the 2008 Asian (YH) genome using SOAPdenovo2. Here, the contig and scaffold N50 of the YH genome were ~20.9 kbp and ~22 Mbp, respectively, which is 3-fold and 50-fold longer than the first published version. The genome coverage increased from 81.16% to 93.91%, and memory consumption was ~2/3 lower during the point of largest memory consumption.
KeywordsGenome Assembly Contig Scaffold Error correction Gap-filling
The increased use of next generation sequencing (NGS) has resulted in an increased growth of the number of de novo genome assemblies being carried out using short reads. Although there are several de novo assemblers available, there remains room for improvement as shown in recent assembly evaluation projects such as Assemblathon 1 and GAGE. Since the publication of the first version of SOAPdenovo, it has been used to assemble many large eukaryotic genomes, but reports have indicated areas that would benefit from updates, including assembly coverage and length[4, 5].
SOAPdenovo2, as with SOAPdenovo, is made up of six modules that handle read error correction, de Bruijn graph (DBG) construction, contig assembly, paired-end (PE) reads mapping, scaffold construction, and gap closure. The major improvements we have made for in SOAPdenovo2 are: 1) enhancing the error correction algorithm, 2) providing a reduction in memory consumption in DBG constructions, 3) resolving longer repeat regions in contig assembly, 4) increasing assembly length and coverage in scaffolding and 5) improving gap closure. Our data show that SOAPdenovo2 outperforms its predecessor on the majority of the metrics benchmarked in the Assemblathon 1 as well as GAGE; and in addition, was able to substantially improve the original assembly of the Asian (YH) genome that was done using SOAPdenovo.
Improvements in SOAPdenovo2
Dealing with sequencing error in NGS data is inevitable, especially for genome assembly applications, the outcome of which could be largely affected by even a small amount of sequencing error. Hence it is mandatory to detect and revise these sequencing errors in reads before assembly[2, 7]. However, the error correction module in SOAPdenovo was designed for short Illumina reads (35-50 bp), which consumes an excessive amount of computational time and memory on longer reads, for example, over 150 GB memory running for two days using 40-fold 100 bp paired-end Illumina HiSeq 2000 reads. Thus, by a skillful exploitation of data indexing strategies, we redeveloped the module, which supports memory efficient long-k-mer error correction and uses a new space k-mer scheme to improve the accuracy and sensitivity (see Additional file1: Supplementary Method 1 and Figures S1-S3). Simulation test shows that the new version runs efficiently and corrects more reads authentically (see Additional file1: Tables S1 and S2).
In DBG-based large-genome assembly, the graph construction step consumes the largest amount of memory. To reduce this in SOAPdenovo2, we implemented a sparse de Bruijn graph method (see Additional file1: Supplementary Method 2), where reads are cut into k-mers and a large number of the linear unique k-mers are combined as a group instead of being stored independently.
Another important factor in the success of DBG-based assembly is k-mer size selection. Using a large k-mer has the advantage of resolving more repeat regions; whereas, use of small k-mers is advantageous for assembling low coverage depth and removing sequencing errors. To fully utilize both these advantages, we introduced a multiple k-mer strategy in SOAPdenovo2 (see Additional file1: Supplementary Method 3 and Figure S4). First, we removed sequencing errors using small k-mers for graph building, and then we rebuilt the graph using larger k-mers iteratively by mapping the reads back to the previous DBG to resolve longer repeats.
Scaffold construction is another area that needs improvement in NGS de novo assembly programs. In the original SOAPdenovo, scaffolds were built by utilizing PE reads starting with short insert sizes (~200 bp) followed iteratively to large insert sizes (~10 kbp). Although this iterative method greatly decreased the complexity of scaffolding and enabled the assembly of larger genomes, there remained many issues that resulted in lower scaffold quality and shorter length. For example, 1) the heterozygous contigs were improperly handled; 2) chimeric scaffolds erroneously built with the smaller insert size PE reads which then hindered the later steps to increase of scaffold length when adding PE reads with larger insert size; and 3) false relationships between contigs without sufficient PE information support were created occasionally. To improve this in SOAPdenovo2, the main changes during the scaffolding stage were as follows: 1) we detected heterozygous contig pairs using contig depth and local contig relationships. Under these conditions, only the contig with higher depth in the heterozygous pairs was kept in scaffold, which reduced the influence of heterozygosity on the scaffolds length; 2) chimeric scaffolds that were built using a smaller insert size library were rectified using information from a larger insert size library, and 3) we developed a topology-based method to reestablish relationships between contigs that had insufficient PE information support (see Additional file1: Supplementary Method 4 and Figures S5-S7).
Short reads enabled us to reconstruct large vertebrate and plant genomes, but the assembly of repetitive sequences longer than the read length still remain to be tackled. In scaffold construction, contigs with certain distance relationship, but without genotypes amid were connected with wildcards. The GapCloser module was designed to replace these wildcards using the context and PE reads information. In SOAPdenovo2, we have improved the original SOAPdenovo GapCloser module, which assembled sequences iteratively in the gaps to fill large gaps. At each iterative cycle, the previous release of GapCloser considered only the reads that could be aligned in current cycle. This method could potentially make for an incorrect selection at inconsistent locations with insufficient information for distinguishment due to the high similarity between repetitive sequences. For SOAPdenovo2, we developed a new method that considered all reads aligned during previous cycles, which allowed for better resolution of these conflicting bases, and thus improved the accuracy of gap closure. (see Additional file1: Supplementary Method 5).
Testing and assessment
Evaluation of Assemblathon1 dataset assemblies
Contig path NG50
Scaffold path NG50
Number of Structural Error
Substitution Error rate
Copy Number Error rate
Genome coverage (%)
Run time (h)
Assemblies of S. aureus and R. sphaeroides
N50 corrected (kb)
Assemblies of Bombus Impatiens
We also used SOAPdenovo2 to reassemble and update the previously assembled YH Asian Genome. The previous assembly was done using SOAPdenovo1, but in addition it was also limited by the very short read lengths (~35 bp) that were the standard output of Illumina Genome Analyzers (GAIIx) at that time and by the insert sizes available (maximum size is 10 kb). To provide an updated assembly with the new program, we generated a new set of PE 100 bp-long reads with an insert size ranging from 180 bp to 40 kbp using the Illumina HiSeq 2000 (see Additional file1: Table S3). These new data were put through both the SOAPdenovo1 and SOAPdenovo2 pipelines. To test out the performance of each new feature in SOAPdenovo2, we also assembled the genome with or without the multi k-mers and sparse DBG modules.
Summary of YH dataset assemblies
Data and Program
Scaffold total length (bp)
Scaffold N50 (bp)
Contig total length (bp)
Contig N50 (bp)
Peak Memory at Graph Construction (G)
SOAPdenovo YH old data
SOAPdenovo YH new data
v2 Sparse & Multi-k-mer
ALLPATHS-LG§ YH new data
A previous report had indicated that most of the segmental duplications (SD) were lost in the earlier published version of the YH. To investigate the SD coverage of new version YH genome sequences, we aligned the contigs of the first version and the new version to 134 Mb of published human SD sequences and found that up to 99% of the published SD sequences were now sufficiently represented (≥ 90% of each sequence) in the updated assembly, while only 21.5% were represented in the earlier version (see Additional file1: Table S4). The rate of SD sequences that appeared more than once with sufficient coverage for each copy was increased from 0.02% to 52.6% in the updated version. The assembly of fragmented genes (noted in) was also improved (see Additional file1: Table S5). For example, average coverage of gene GRM5 increased from 90% to 96% and the number of fragments decreased from 162 to 4.
The work here demonstrates that SOAPdenovo2 is greatly improved over the initial version and specifically in areas that have been highlighted as problems in the currently available short-read de novo assembly programs. It thus provides an effective solution for carrying out de novo genome assembly especially for eukaryotic genomes. We have also been able to provide a much better quality version of the previously assembled YH genome, which will serve as an excellent reference genome for use in Chinese population studies, as well as for general human genome studies. SOAPdenovo2 has been successfully deployed in public computing clouds including TianHe series supercomputer and Amazon EC2.
Availability and requirements
Project name: SOAPdenovo2
Project home page and forum:http://soapdenovo2.sourceforge.net/
Operating system(s): Unix, Linux, Mac
Programming language: C, C++
Other requirements: GCC version ≥ 4.4.5
License: GNU General Public License version 3.0 (GPLv3)
Any restrictions to use by non-academics: none
Availability of supporting data
The raw reads from the YH genome generated in this work are available from the BGI website, the EBI short read archive with study accession [EMBL:ERP001652], and also from the GigaScience database. The updated assembly is also available at GigaScience. In order to facilitate readers to repeat the experiments, the tools and configured packages including commands and necessary utilities are available from our FTP serverftp://public.genomics.org.cn/BGI/SOAPdenovo2, and are also being made available from the GigaScience database.
de Bruijn graph
We would like to thank the users of SOAPdenovo who tested the program, reported bugs, and proposed improvements to make it more powerful and user-friendly. Thanks to TianHe research and development team of National University of Defense Technology to have tested, optimized and deployed the software on TianHe series supercomputers.
The project was supported by the State Key Development Program for Basic Research of China-973 Program (2011CB809203); National High Technology Research and Development Program of China-863 program (2012AA02A201); the National Natural Science Foundation of China (90612019); the Shenzhen Key Laboratory of Trans-omics Biotechnologies (CXB201108250096A); and the Shenzhen Municipal Government of China (JC201005260191A and CXB201108250096A). Tak-Wah Lam was partially supported by RGC General Research Fund 10612042.
- Earl D, Bradnam K, St John J, Darling A, Lin D, Fass J, Yu HO, Buffalo V, Zerbino DR, Diekhans M, Nguyen N, Ariyaratne PN, Sung WK, Ning Z, Haimel M, Simpson JT, Fonseca NA, Docking TR, Ho IY, Rokhsar DS, Chikhi R, Lavenier D, Chapuis G, Naquin D, Maillet N, Schatz MC, Kelley DR, Phillippy AM, Koren S: Assemblathon 1: a competitive assessment of de novo short read assembly methods. Genome Res. 2011, 21: 2224-2241. 10.1101/gr.126599.111.PubMed CentralView ArticlePubMed
- Salzberg SL, Phillippy AM, Zimin A, Puiu D, Magoc T, Koren S, Treangen TJ, Schatz MC, Delcher AL, Roberts M, Marçais G, Pop M, Yorke JA: GAGE: a critical evaluation of genome assemblies and assembly algorithms. Genome Res. 2012, 22: 557-567. 10.1101/gr.131383.111.PubMed CentralView ArticlePubMed
- Li R, Zhu H, Ruan J, Qian W, Fang X, Shi Z, Li Y, Li S, Shan G, Kristiansen K, Li S, Yang H, Wang J, Wang J: De novo assembly of human genomes with massively parallel short read sequencing. Genome Res. 2010, 20: 265-272. 10.1101/gr.097261.109.PubMed CentralView ArticlePubMed
- Alkan C, Sajjadian S, Eichler EE: Limitations of next-generation genome sequence assembly. Nat Methods. 2011, 8: 61-65. 10.1038/nmeth.1527.PubMed CentralView ArticlePubMed
- Gnerre S, Maccallum I, Przybylski D, Ribeiro FJ, Burton JN, Walker BJ, Sharpe T, Hall G, Shea TP, Sykes S, Berlin AM, Aird D, Costello M, Daza R, Williams L, Nicol R, Gnirke A, Nusbaum C, Lander ES, Jaffe DB: High-quality draft assemblies of mammalian genomes from massively parallel sequence data. Proc Natl Acad Sci U S A. 2011, 108: 1513-1518. 10.1073/pnas.1017351108.PubMed CentralView ArticlePubMed
- Wang J, Wang W, Li R, Li Y, Tian G, Goodman L, Fan W, Zhang J, Li J, Zhang J, Guo Y, Feng B, Li H, Lu Y, Fang X, Liang H, Du Z, Li D, Zhao Y, Hu Y, Yang Z, Zheng H, Hellmann I, Inouye M, Pool J, Yi X, Zhao J, Duan J, Zhou Y, Qin J: Genome sequence of YH: the first diploid genome sequence of a Han Chinese individual. GigaScience. 2011, [http://dx.doi.org/10.5524/100015]
- Zerbino DR, Birney E: Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 2008, 18: 821-829. 10.1101/gr.074492.107.PubMed CentralView ArticlePubMed
- Ye C, Ma ZS, Cannon CH, Pop M, Yu DW: Exploiting sparseness in de novo genome assembly. BMC Bioinformatics. 2012, 13 Suppl 6: S1.View ArticlePubMed
- Peng Y, Leung HC, Yiu SM, Chin FY: IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth. Bioinformatics. 2012, 28: 1420-1428. 10.1093/bioinformatics/bts174.View ArticlePubMed
- Dayarian A, Michael TP, Sengupta AM: SOPRA: scaffolding algorithm for paired reads via statistical optimization. BMC Bioinformatics. 2010, 11: 345-10.1186/1471-2105-11-345.PubMed CentralView ArticlePubMed
- The Assemblathon. [http://assemblathon.org]
- Wang J, Wang W, Li R, Li Y, Tian G, Goodman L, Fan W, Zhang J, Li J, Zhang J, Guo Y, Feng B, Li H, Lu Y, Fang X, Liang H, Du Z, Li D, Zhao Y, Hu Y, Yang Z, Zheng H, Hellmann I, Inouye M, Pool J, Yi X, Zhao J, Duan J, Zhou Y, Qin J: The diploid genome sequence of an Asian individual. Nature. 2008, 456: 60-65. 10.1038/nature07484.PubMed CentralView ArticlePubMed
- Wang J, Li Y, Luo R, Liu B, Xie Y, Li Z, Fang X, Zheng H, Qin J, Yang B, Yu C, Ni P, Li N, Guo G, Ye J, Fang L, Su Y, Asan , Zheng H, Kristiansen K, Wong GK, Nielsen R, Durbin R, Bolund L, Zhang X, Li S, Yang H, Wang J: Updated genome assembly of YH: the first diploid genome sequence of a Han Chinese individual (version 2, 07/2012). GigaScience Database. 2012, [http://dx.doi.org/10.5524/100038]
- The UCSC Genome Bioinformatics site. [http://genome.ucsc.edu/]
- She X, Jiang Z, Clark RA, Liu G, Cheng Z, Tuzun E, Church DM, Sutton G, Halpern AL, Eichler EE: Shotgun sequence assembly and recent segmental duplications within the human genome. Nature. 2004, 431: 927-930. 10.1038/nature03062.View ArticlePubMed
- Yan Huang - The first Asian diploid genome. [http://yh.genomics.org.cn]
- Luo R, Liu B, Xie Y, Li Z, Huang W, Yuan J, He G, Chen Y, Pan Q, Liu Y, Tang J, Wu G, Zhang H, Shi Y, Liu Y, Yu C, Wang B, Lu Y, Han C, Cheung D, Yiu SM, Liu G, Zhu X, Peng S, Li Y, Yang H, Wang J, Lam TW, Wang J: Software and supporting material for “SOAPdenovo2: an empirically improved memory-efficient short read de novo assembly”. GigaScience Database. 2012, [http://dx.doi.org/10.5524/100044]
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.