LINKS: Scalable, alignment-free scaffolding of draft genomes with long reads
© Warren et al. 2015
Received: 28 May 2015
Accepted: 29 July 2015
Published: 4 August 2015
Owing to the complexity of the assembly problem, we do not yet have complete genome sequences. The difficulty in assembling reads into finished genomes is exacerbated by sequence repeats and the inability of short reads to capture sufficient genomic information to resolve those problematic regions. In this regard, established and emerging long read technologies show great promise, but their current associated higher error rates typically require computational base correction and/or additional bioinformatics pre-processing before they can be of value.
We present LINKS, the Long Interval Nucleotide K-mer Scaffolder algorithm, a method that makes use of the sequence properties of nanopore sequence data and other error-containing sequence data, to scaffold high-quality genome assemblies, without the need for read alignment or base correction. Here, we show how the contiguity of an ABySS Escherichia coli K-12 genome assembly can be increased greater than five-fold by the use of beta-released Oxford Nanopore Technologies Ltd. long reads and how LINKS leverages long-range information in Saccharomyces cerevisiae W303 nanopore reads to yield assemblies whose resulting contiguity and correctness are on par with or better than that of competing applications. We also present the re-scaffolding of the colossal white spruce (Picea glauca) draft assembly (PG29, 20 Gbp) and demonstrate how LINKS scales to larger genomes.
This study highlights the present utility of nanopore reads for genome scaffolding in spite of their current limitations, which are expected to diminish as the nanopore sequencing technology advances. We expect LINKS to have broad utility in harnessing the potential of long reads in connecting high-quality sequences of small and large genome assembly drafts.
KeywordsNanopore sequencing Scaffolding Genome assembly Next-generation sequencing LINKS
Long-read sequencing technology has rapidly matured over the past few years, and the benefit of long reads for genome assembly is indisputable . Recently, groups have shown that de novo assembly of error-rich long reads into complete bacterial genomes is possible [2–4]. Portable long read sequencing technology is at our doorstep, thanks to leaps in microfluidics, electronics and nanopore technologies . Expected to be a strong contender in the kilobase-long read domain, Oxford Nanopore Technologies Ltd (ONT, Oxford, UK) offers a miniature molecule “sensor” that is currently in a limited early access beta-testing phase through the MinION™ Access Programme (MAP). At present, raw uncorrected sequence reads generated by the instrument have limited utility for de novo assembly of genomes, which is mostly due to their associated high base errors and indels rates . Recently, Quick and colleagues  publicly released ONT E. coli long reads as part of the MAP. Although their assessment identified some of the shortcomings of the current technology, it also highlighted its great potential, including a low-cost throughput and kilobase-long reads.
As with any sequencing platform, the ONT data have a unique pattern of correct base calls, mismatches and insertions/deletions (indels). The publicly available datasets utilize the R7 and R7.3 chemistry of the vendor. We observed that under both chemistries, the statistical properties of mismatches and indels follow common profiles, which can be described by mixture models. When we fit the R7 and R7.3 chemistry datasets to these distributions, we observe that they differ in their parameters, but that the structures of the mixture models hold. This is encouraging, as it indicates that the fundamental principles of these distributions can be fixed, and that the datasets can be described by parametric statistical models. It also supports our observation that accurate base calls come in bursts – a property we use in the proposed LINKS algorithm.
The datasets supporting the results of this article are available in the GigaDB repository  and the European Nucleotide Archive (ENA) under accession number ERP007108  for E. coli K-12, at the Figshare repository  and ENA accession ERR668747 for S. Typhi  and at a laboratory public web space for S. cerevisiae . The A. thaliana assemblies and Pacific Biosciences (PacBio) reads are available online [9, 10, 18]. All LINKS assemblies of small genomes (≤12Mbp) presented herein can be reproduced exactly by downloading LINKS and executing the “runall.sh” script from the package/test repository. For the A. thaliana and P. glauca LINKS assemblies, we provide bash shell scripts in the package distribution. We also provide each final LINKS assembly ≤12Mbp in the GigaDB repository .
Here ax ∈ (0, 1) are mixture parameters, λm represents the expected value of the Poisson distribution, px are the Bernoulli trial probabilities for the geometric distributions, and lx and κx, respectively, are the scale and shape parameters of the Weibull distributions (see Additional file 1: Table S1). Our substitution profile is generated based on LAST  alignments and it is consistent with recent reports looking at error models of ONT reads derived from the E. coli phage M13 . Although the substitution rate is much higher for E. coli K-12 2D reads, the general trend in transition probability is the same as for the M13 phage. For instance, A and T are more stable than C and G and bases are generally more likely to be substituted by C and G during sequencing. We highlight that modeling errors will be influenced by parameterization and choice of alignment software. However, the underlying error models derived here support visual inspections of ONT-to-reference pair-wise nucleotide sequence alignments, that is, stretches of accurate bases interspersed with bursts of indels and/or mismatched bases. These correct-matching base stretches are of length 13–15 bp on average, long enough to confer specificity (Fig. 1). Pairing such k stretches at given distance intervals and comprehensively exploring the paired k-mer space effectively compensates for the high-error rate currently observed in these data, is the basis of our method and a strategy transferable to higher quality DNA reads or sequences, as we show below.
QUAST analysis of a baseline E. coli assembly and re-scaffolded assemblies using Oxford Nanopore 2D (R7 chem.) or raw (R7.3 chem.) reads
Stats based on sequences ≥ 500 bp
A. ABySS contigs
B. ABySS scaffolds
C. LINKS k15d4000
D. LINKS k15d4000
E. SSPACE-LR g200
F. LINKS x30 k15d500–16kbp
G. LINKS x30 k15d500–16kbp
H. LINKS x30 k15d500–16kbp
A. +ONT Full 2D R7
B. + ONT Full 2D R7
B. +ONT Full 2D R7
B. +ONT Full 2D R7
B. +ONT All 2D R7
B. +raw ONT R7.3
Read input fold coverage
NG50 length (bp)
Genes + parts
4,442 + 63
4,443 + 62
4,443 + 62
4,443 + 62
4,448 + 57
4,443 + 62
4,440 + 62
4,443 + 62
Number of N’s per 100 kbp
QUAST analysis of baseline and re-scaffolded E. coli K-12, S. Typhi H58, S. cerevisiae S288c and W303 assemblies using Oxford Nanopore Technologies reads
Data/Chemistry/ Fold coverage
Number of contigs (> = 500 bp)
NG50 length (bp)
NA50 length (bp)
Number of genes
Number of N’s per 100 kbp
Number of mis-assemblies
Mis-assemblies type relocations/Trans-locations/Inversions
E. coli K-12
S. Typhi H58
S. cerevisiae S288c
S. cerevisiae W303
13x of >10 kb reads
Similarly to E. coli, we ran LINKS iteratively to re-scaffold a baseline Illumina assembly of Salmonella enterica serovar Typhi (S. Typhi) haplotype H58 with 2D ONT reads, and compared the results to those provided by the authors . The study authors report a marked improvement in assembly contiguity when assembling concurrently Illumina and ONT reads (34 contigs, 319 kbp N50 length), when compared to a baseline assembly of Illumina-only data (86 contigs, 154 kbp N50 length), consistent with our assessment of their data. The final LINKS assembly on this dataset, took 21 min. and used 8.2 GB RAM to yield 22 contigs with an N50 length of 652 kbp, approximately double the contiguity previously reported . Testing LINKS on the larger S. cerevisiae W303 ONT dataset , we obtained an assembly that compares with the Celera Assembly of Illumina-corrected ONT reads (Nanocorr) in contiguity, but with 40 % less errors than the Pilon-polished  Celera assembly (Fig. 3, Table 2). It is worth noting, however, that LINKS is a scaffolder and as such, merged contigs are ordered and oriented within scaffolds, separated by gaps/overlaps and that its resulting W303 assembly, much like that of other scaffolders, comprises over 3700 scaffolds, versus only 95 and 121 for the resulting Celera Assembly (CA) assembly of Nanocorr and Nanopore Synthetic-long (NaS) reads, respectively. When scaffolding the W303 baseline Illumina assemblies with Nanocorr-corrected ONT reads, we notice that the error count of the resulting LINKS assembly is marginally less than the polished Celera assembly, albeit higher (2.6-fold) than a LINKS assembly re-scaffolded with uncorrected reads, which indicates that Nanocorr-corrected reads may introduce errors that are propagated during assembly and/or scaffolding of the yeast data (Fig. 3). The quality of the resulting LINKS assembly depends on a few factors, including the quality of the input assembly and the stringency of the imposed linkage, which was fairly relaxed in this study (e.g., minimum 5 links). We assessed misassembly types, using the breadth of the ONT data presented and observed that LINKS outperforms all methods on the larger yeast data, including Nanocorr. Even though it introduces more relocation errors in S. Typhi (errors caused by gap/overlap size estimates over/under 1kbp), compared to SPAdes (St. Petersburg genome assembler), it never introduces inversions (Table 2).
While LINKS uses large amounts of memory with increased target genome sizes, this can be mitigated by the Bloom filter implementation (LINKS v1.5), which decreases the RAM usage 3-fold compared to earlier versions. With all versions of LINKS, a smaller memory footprint is achieved by increasing the sliding window step (−t) and augmenting the distance between k-mer pairs (−d), which in turn decreases the k-mer pair space. Because LINKS is a scaffolder, it may be used downstream of other assembly methodologies, as exemplified on the S. cerevisiae W303 data, where two additional merges from the polished CA + Nanocorr assembly were made using raw W303 ONT reads (Table 2).
We demonstrate the broad applicability of LINKS, by re-scaffolding a high-quality draft of the 120-Mbp A. thaliana Ler-1 genome  with either raw or corrected [10, 18] long sequence reads from PacBio. We find that k = 21 worked best with this data, and that lower k values (k = 13 and 15 explored), did not merge any scaffolds due to increased conflicts in contig pairs (not shown). Iterative scaffolding (4 iterations) using a low sliding window for 3 out of the 4 iterations (−t 5) completed in 1h52m and required 84 GB using the 118-fold raw PacBio data set and 3h05m and 151 GB RAM with the lower depth (28-fold) ECTools-corrected PacBio reads. The increased resource requirements are not surprising given that error correction yields 135.9 M k-mer pairs from 288,217 reads, which is more than 3 orders of magnitude compared to the 1.33 M extracted from the 3.45 M raw PacBio reads (not shown). We find that the resulting LINKS assemblies are very contiguous, especially when the PacBio reads are corrected (NG50 > 2.5 Mbp), and highlights 1) the utility of LINKS for retrospective scaffolding of draft genomes with new long read sequencing data and that 2) LINKS scaffolding can be complimentary to read correction methodologies (Additional file 1: Figure S7). When compared to other assemblies of PacBio-only data, we find that the final LINKS assemblies of high-quality Illumina draft assemblies tend to harbor fewer errors, as also demonstrated on the yeast data (Fig. 3, Table 2, Additional file 1: Figure S8 and Table S2). Four LINKS iterations performed on a baseline Illumina assembly with raw PacBio reads representing 118-fold coverage increased the NG50 length over 8-fold from 59 to 492 kbp. The use of ECTools-corrected PacBio reads  further increased the contiguity, as measured by the NG50 length (765.4 kbp), but also yielded 284 additional misassemblies compared to the 4th and final LINKS iteration that used raw PacBio reads. The Illumina Allpaths-LG assembly [9, 29] was already very contiguous at 310.7 kbp NG50 length, but re-scaffolding with the same raw and ECTools-corrected PacBio data increased the NG50 length to 1.45 and 2.65 Mbp, respectively. This is in contrast to the ECTools and PacBioToCA assemblies (NG50 = 487.2 and 370.7 kbp, in this order) still three times lower in assembly contiguity compared to The Hierarchical Genome Assembly Process (HGAP) (NG50 = 8.429 Mbp). Evidently, because LINKS re-scaffolded assemblies are derived from fragmented Illumina draft assemblies, they contain ambiguous bases (Ns) when compared to their PacBio-only counterparts. However, the contiguity metric normalized on genome size and that accounts for assembly error, the NGA50 length, is similar (87.5 vs. 78.0 kbp) between the highly contiguous HGAP PacBio-only assembly and the Allpaths-LG Illumina assembly re-scaffolded with LINKS using ECTools-corrected reads, which suggests that LINKS offers a good compromise between contiguity and errors, in a lightweight and easy to use software package (Additional file 1: Figure S8 and Table S2).
In recent months, there have been advances in correcting ONT reads [4, 8, 26], which makes the resulting, corrected, long reads suitable to assembly with established overlap layout consensus assembly software . It is important to note that both the Nanocorr and Nanocorrect/Nanopolish ONT long read correction methods are not assembly methodologies per se, but base error correction utilities and as such, the resulting error-corrected reads they produce can be readily used by LINKS to contiguate pre-existing genome assemblies. Likewise, LINKS is a genome scaffolder, not a sequence assembler, and does not attempt to correct assembly bases or fill Ns that result from its merges. Like other scaffolding algorithm before it, it orders and orient contigs into larger scaffolds that could be used to characterize genomic loci of interest. The novelty of the algorithm lays in its scalability and usage of paired k-mers from varied long sequence sources (Oxford Nanopore Technologies, Pacific Biosciences, draft sequences), without the need to correct read bases first.
As larger genomes are sequenced with ONT and PacBio, larger k-values will be needed to disambiguate linkages that would otherwise likely happen by chance at the low value of k (k = 15) used herein. However, using larger k-mers may not be possible when using the current R7 and R7.3 chemistries of ONT, given the error models we derived and present here. However rapid improvements in chemistries, base calling and error-correction algorithms already indicate that this is unlikely to be a problem for the broad applicability of LINKS to larger genomes, using a diverse long-read source. This is exemplified here in our use of raw and error-corrected PacBio long reads (k = 21) to re-scaffold the 120 Mbp A. thaliana genome and the use of a genotype assembly draft of white spruce (k = 26) to re-scaffold the 20-Gbp P. glauca genome.
LINKS is a scalable, alignment-free scaffolder, which extracts spaced k-mers from reads as its pairing information source to order and orient sequence contigs into scaffolds. It takes input reads from a variety of sources, including ONT and PacBio sequences, but as demonstrated, it can also work with other long sequences to contiguate assemblies. It offers a general framework that could apply to scaffolding very large genomes, such as that of white spruce using another assembly draft or reference in lieu of long reads. This study also highlights the present utility of ONT reads for genome scaffolding in spite of their current limitations, which are expected to diminish as nanopore sequencing technology advances. LINKS is available for public use .
Sequence data, assembly, and scaffolding
E. coli K-12 substrain MG1655 Illumina MiSeq v3 TruSeq Nano read data (paired end 301 bp, fragment length 550 bp) was downloaded from BaseSpace®, and randomly sub-sampled to ~250-fold coverage. Overlapping read pairs were merged with ABySS-mergepairs (−q 15) and resulting ca. 550 bp pseudoreads were assembled with ABySS v1.5.2  (k = 480 l = 40 s = 1000) yielding 67 and 61 contigs and scaffolds ≥ 500 bp, respectively. Contigs and scaffolds (Table 1A and B, Additional file 1: Figure S2) were scaffolded with LINKS v1.5 (k = 15, d = 4000, default parameters) using the E. coli K-12 substr. MG1655 R7 Full 2D ONT data from Quick and colleagues  (R7 chemistry ONI/NONI ENA:ERX708228), and results are shown in Table 1C, Additional file 1: Figure S3 and Table 1D, Additional file 1: Figure S4, in that order. SSPACE-LongRead  v1.1 (abbreviated SSPACE-LR. Options: g = 200, with defaults parameters) ran on the Table 1B assembly (Table 1E). ABySS scaffolds were also re-scaffolded iteratively with LINKS (v1.5, k = 15, d = 500 to 16000, 30 iterations) using the Full 2D ONT reads (Table 1F) and, in separate experiment, all available 2D reads (Table 1G, Additional file 1: Figure S5) and all available R7.3 chemistry raw uncorrected FASTA reads derived from poretools  conversion (ENA:ERX593921;Table 1H, Additional file 1: Figure S6). A baseline S. Typhi H58 Illumina assembly  (Genbank:GCA_000944835.1) was re-scaffolded with LINKS (v1.5, k = 15, d = 500 to 4000, t = 1, a = 0.1, 11 iterations) using 2D ONT reads (ENA:ERR668747). A Baseline S. cerevisiae W303 Illumina MiSeq assembly [8, 17] and S. cerevisiae S288c were respectively re-scaffolded with SSPACE-LongRead (g = 200), A Hybrid Assembler (AHA)  and LINKS (v1.5, k = 15, d = 2-15kbp, 27 or 29 iterations) using 262,463 raw ONT reads (Fig. 3). The baseline A. thaliana Ler-1 Allpaths-LG assembly  was re-scaffolded with LINKS (v1.5, t = 20|5|5|5, k = 21, d = 5–20kbp, 4 iterations) using 19 SMRTcells of corrected (ECTools ) or 93 SMRTcells PacBio raw reads totaling 14.2 GB of data and providing 118-fold coverage of the genome, 38-fold from reads 10 kbp or larger  (Additional file 1: Table S2). PacBio assemblies used for comparison were downloaded  and assessed with QUAST  using the reference A. thaliana TAIR10 genome (Genbank:GCA_000001735.1). The 20-Gbp white spruce [11, 12] V3 assembly (Genbank:ALWZ030000000, 4.2 M scaffolds) was re-scaffolded with LINKS 14 times (v1.1, k = 26, t = 200–50 d = 5–100kbp) using the draft white spruce WS77111 V1 genotype assembly (Genbank:JZKD010000000, 4.1 M sequences) (Fig. 4). The white spruce MPET libraries used for validation are presented in  and available from the dnanexus repository . Validation of merges by automated gap closure was done with the scalable gap-filling software Sealer, using the same parameters described in a recent publication , performed on the final 14th re-scaffolded LINKS assembly (Additional file 2). All benchmarking was done on a computer with Intel(R) Xeon(R) CPU E5-2699 v3 at 2.30GHz, 72 CPUs with 264 GB RAM.
FASTA sequences to scaffold are supplied as input (−f), and are shredded to k-mers on both strands, populating a Bloom filter  whose number of elements corresponds to a rough approximation of the number of k-mers in the draft genome based on file size. The size of the filter can be adjusted by controlling its false positive rate (−p). Building a Bloom filter is optional (−x), but strongly recommended as it decreases the memory usage and run time when tested on smaller genomes (<20 Mb). For large genomes (≥1 Gb), we recommend pre-building the Bloom filter with the supplied utility (./tools/writeBloom.pl in the distribution). ONT reads are supplied as input (−s option, file-of-filenames listing FASTA/FASTQ formatted files) and k-mer pairs are extracted using user-defined k-mer length (−k) and distance between the 5’-end of each pairs (−d) over a sliding window (−t). When both k-mers are found in the Bloom filter, unique k-mer pairs at set distance are hashed, tracking the contig or scaffold of origin, k-mer positions and frequencies of observation. LINKS has two main stages: contig pairing, and scaffold layout. Cycling through k-mer pairs, k-mers that are uniquely placed on contigs are identified. Putative contig pairs are formed if k-mer pairs are on different contigs. Contig pairs are only considered if the calculated distances between them satisfy the mean distance provided (−d), while allowing for a deviation (−e). Contig pairs having a valid gap or overlap are allowed to proceed to the scaffolding stage. Contigs in pairs may be ambiguous: a given contig may link to multiple contigs. To mitigate, the number of spanning k-mer pairs (links) between any given contig pair is recorded, along with a mean distance estimate. Once pairing between contigs is complete, the scaffolds are built using contigs in turn until all have been incorporated into a scaffold. Scaffolding is controlled by merging sequences only when a minimum number of links (−l) join two contig pairs, and when links are dominant compared to that of another possible pairing (−a). The predecessor of LINKS is the unpublished scaffolding engine in the widely used SSAKE assembler , and foundation of the SSPACE-LongRead scaffolder . A summary of the scaffold layout is provided (.scaffold) as a text file, and captures the linking information of successful scaffolds. A FASTA file (.scaffold.fa) is generated using that information, placing N-pads to represent the estimated lengths of gaps, and a single “n” in cases of overlaps between contigs. A log summary of k-mer pairing in the assembly is provided (.log) along with a text file describing possible issues in pairing (.pairing_issues), pairing distribution (.pairing_distribution.csv) and compressed Bloom filter (.bloom). The Bloom filter is intended to be re-used (supplied via -r) for iterative LINKS runs.
2D ONT reads from a single run (ERX708228) were aligned to reference genome using LAST  (v581, options: −a 1 -r1 -b1), consistent with that of other reports [6, 21]. Only the best alignment of each query sequence was chosen, and alignments were clipped from both ends to the start of the first match and the end of the last match positions. Each clipped alignment is composed of match, mismatch, insertion and deletion fragments. The lengths of these fragments were tallied, and mismatch fragment lengths were stored as zero-indexed values, while the indels were stored as one-indexed values to model interarrival times of “failures”. The model fitting was performed using R. All proposed mixture model fits were tested using Kolmogorov–Smirnov tests with a p-value threshold of 0.05.
Availability and requirements
Project name: Long Interval Nucleotide K-mer Scaffolder
Operating system: Unix, Mac OS X
Programming language: PERL
Other requirements: Unix
License: GNU General Public License - GPL.
Availability of supporting data
European Nucleotide Archive
Long Interval Nucleotide K-mer Scaffolder
MinION™ Access Programme
Nanopore Synthetic-long reads
Oxford Nanopore Technologies
Random Access Memory
This work is partly funded by Genome Canada (171CGB), British Columbia Cancer Foundation, and Genome British Columbia. Research reported in this publication was also partly supported by the National Human Genome Research Institute of the National Institutes of Health under Award Number R01HG007182. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health or other funding organizations. We thank Henri van de Geest for sharing his insights on the re-scaffolding of the A. thaliana genome with PacBio long reads.
- Koren S, Phillippy AM. One chromosome, one contig: complete microbial genomes from long-read sequencing and assembly. Curr Opin Microbiol. 2014;23C:110–20.Google Scholar
- Chin CS, Alexander DH, Marks P, Klammer AA, Drake J, Heiner C, et al. Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data. Nat Methods. 2013;10:563–69.View ArticlePubMedGoogle Scholar
- Berlin K, Koren S, Chin C-S, Drake J, Landolin JM, Phillippy AM. Assembling Large Genomes with Single-Molecule Sequencing and Locality Sensitive Hashing. Nat Biotechnol. 2015;33:623–30.View ArticlePubMedGoogle Scholar
- Madoui MA, Engelen S, Cruaud C, Belser C, Bertrand L, Alberti A, et al. Genome assembly using Nanopore-guided long and error-free DNA reads. BMC Genomics. 2015;16:327.
- Clarke J, Wu HC, Jayasinghe L, Patel A, Reid S, Bayley H. Continuous base identification for single-molecule nanopore DNA sequencing. Nat Nanotechnol. 2009;4:265–70.View ArticlePubMedGoogle Scholar
- Quick J, Quinlan AR, Loman NJ. A reference bacterial genome dataset generated on the MinION™ portable single-molecule nanopore sequencer. Gigascience. 2014;3:22.View ArticlePubMedPubMed CentralGoogle Scholar
- Ashton PM, Nair S, Dallman T, Rubino S, Rabsch W, Mwaigwisya S, et al. MinION nanopore sequencing identifies the position and structure of a bacterial antibiotic resistance island. Nat Biotechnol. 2015;33:296–300.View ArticlePubMedGoogle Scholar
- Goodwin S, Gurtowski J, Ethe-Sayers S, Deshpande P, Schatz MC, McCombie WR. Oxford Nanopore Sequencing and de novo Assembly of a Eukaryotic Genome. bioRxiv. 2015. doi:10.1101/013490.Google Scholar
- Data release of ALLPATHS-LG de novo assembly for A. thaliana Ler-1. http://1001genomes.org/data/MPI/MPISchneeberger2011/releases/current/Ler-1/Assemblies/Allpaths_LG/
- Lee H, Gurtowski J, Yoo S, Marcus S, McCombie WR, Schatz M. Error correction and assembly complexity of single molecule sequencing reads. bioRxiv. 2014. doi:10.1101/006395.Google Scholar
- Birol I, Raymond A, Jackman SD, Pleasance S, Coope R, Taylor GA, et al. Assembling the 20 Gb white spruce (Picea glauca) genome from whole-genome shotgun sequencing data. Bioinformatics. 2013;29:1492–7.View ArticlePubMedPubMed CentralGoogle Scholar
- Warren RL, Keeling C, Yuen M, Raymond A, Taylor G, Vandervalk BP, et al. Improved white spruce (Picea glauca) genome assemblies and annotation of large gene families of conifer terpenoid and phenolic defense metabolism. The Plant Journal. 2015;83:189–212.View ArticlePubMedGoogle Scholar
- Bacterial whole-genome read data from the Oxford Nanopore Technologies MinION™ nanopore sequencer. http://gigadb.org/dataset/100102.
- Bacterial whole-genome read data from the Oxford Nanopore Technologies MinION™ nanopore sequencer at the European Nucleotide Archive. http://www.ebi.ac.uk/ena/data/view/ERP007108.
- Oxford nanopore and Illumina read data and assemblies for Salmonella Typhi. http://figshare.com/articles/Salmonella_Typhi_H58_MinION_and_Illumina_data/1170110.
- Salmonella Typhi whole-genome read data from the Oxford Nanopore Technologies MinION™ nanopore sequencer at the European Nucleotide Archive. http://www.ebi.ac.uk/ena/data/view/ERR668747.
- Oxford Nanopore Sequencing, Hybrid Error Correction, and de novo Assembly data resource for S. cerevisiae. http://schatzlab.cshl.edu/data/nanocorr.
- PacBio and Illumina data resource for the A. thaliana genome. http://schatzlab.cshl.edu/data/ectools.
- Warren RL, Yang C, Vandervalk BP, Behsaz B, Lagman A, Jones SJM, et al. Software and supporting material for “LINKS: Scalable, alignment-free scaffolding of draft genomes with long reads”. GigaScience Database. 2015. http://dx.doi.org/10.5524/100159.
- Kiełbasa SM, Wan R, Sato K, Horton P, Frith MC. Adaptive seeds tame genomic sequence comparison. Genome Res. 2011;21:487–93.View ArticlePubMedPubMed CentralGoogle Scholar
- Jain M, Fiddes IT, Miga KH, Olsen HE, Paten B, Akeson M. Improved data analysis for the MinION nanopore sequencer. Nat Methods. 2015;12:351–6.View ArticlePubMedPubMed CentralGoogle Scholar
- Simpson JT, Wong K, Jackman SD, Schein JE, Jones SJ, Birol I. ABySS: a parallel assembler for short read sequence data. Genome Res. 2009;19:1117–23.View ArticlePubMedPubMed CentralGoogle Scholar
- Earl D, Bradnam K, St John J, Darling A, Lin D, Fass J, et al. Assemblathon 1: a competitive assessment of de novo short read assembly methods. Genome Res. 2011;21:2224–41.View ArticlePubMedPubMed CentralGoogle Scholar
- Gurevich A, Saveliev V, Vyahhi N, Tesler G. QUAST: quality assessment tool for genome assemblies. Bioinformatics. 2013;29:1072–5.View ArticlePubMedPubMed CentralGoogle Scholar
- Boetzer M, Pirovano W. SSPACE-LongRead: scaffolding bacterial draft genomes using long read sequence information. BMC Bioinformatics. 2014;15:211.View ArticlePubMedPubMed CentralGoogle Scholar
- Loman NJ, Quick J, Simpson JT. A complete bacterial genome assembled de novo using only nanopore sequencing data. Nat Methods. 2015. doi:10.1038/nmeth.3444.Google Scholar
- Myers EW, Sutton GG, Delcher AL, Dew IM, Fasulo DP, Flanigan MJ, et al. A whole-genome assembly of Drosophila. Science. 2000;287:2196–204.View ArticlePubMedGoogle Scholar
- Walker BJ, Abeel T, Shea T, Priest M, Abouelliel A, Sakthikumar S, et al. Pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement. PLoS One. 2014;9:e112963.View ArticlePubMedPubMed CentralGoogle Scholar
- Gnerre S, MacCallum I, Przybylski D, Ribeiro F, Burton J, Walker B, et al. High-quality draft assemblies of mammalian genomes from massively parallel sequence data. Proc Natl Acad Sci USA. 2011;108:1513–8.View ArticlePubMedGoogle Scholar
- LINKS software release pages. http://www.bcgsc.ca/bioinfo/software/links.
- Loman NJ, Quinlan AR. Poretools: a toolkit for analyzing nanopore sequence data. Bioinformatics. 2014;30:3399–401.View ArticlePubMedPubMed CentralGoogle Scholar
- Rasko DA, Webster DR, Sahl JW, Bashir A, Boisen N, Scheutz F, et al. Origins of the E. coli strain causing an outbreak of hemolytic-uremic syndrome in Germany. N Engl J Med. 2011;365:709–17.View ArticlePubMedPubMed CentralGoogle Scholar
- Sequence read data for Picea glauca PG29 at the Sequence Read Archive. http://sra.dnanexus.com/studies/SRP014489
- Paulino D, Warren RL, Vandervalk BP, Raymond A, Jackman SD, Birol I. Sealer: a scalable gap-closing application for finishing draft genomes. BMC Bioinformatics. 2015;16:230.View ArticlePubMedPubMed CentralGoogle Scholar
- Bloom BH. Space/time trade-offs in hash coding with allowable errors. Communications of the ACM. 1970;13:422–6.View ArticleGoogle Scholar
- Warren RL, Sutton GG, Jones SJ, Holt RA. Assembling millions of short DNA sequences using SSAKE. Bioinformatics. 2007;23:500–1.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.