Using a mixture of experimental data and statistical approaches, we have evaluated a large number of de novo genome assemblies for three vertebrate species. It is worth highlighting the fact that the true genome sequences for these species remain unknown and that this study is focused on a comparative analysis of the various assemblies.
Interspecific vs. intraspecific variation in assembly quality
Overall, we observed that bird assemblies tended to have much longer contigs (Additional file 2: Figure S1), longer scaffolds (Figures 1, 2 and 3), and had more assemblies that comprised 100% (or more) of the estimated genome size (1.2 Gbp for bird) than the other two species. This is potentially a reflection of the much higher coverage of the bird genome than the other two species (Table 2). Bird assemblies also performed better than fish and snake when assessed by the optical map data (Figures 13, 14 and 15). The optical map data also suggested that fish assemblies were notably poorer than the other two species. These widely varying results suggest that differing properties of the three genomes pose different challenges for assemblers, but it may also reflect differences in the qualities of the three optical maps.
Several other metrics suggest that it is the snake genome that, on average, had the highest scoring assemblies of any of the species. For example, the average number of CEGs detected in the competitive assemblies of bird, fish, and snake was 383, 418 and 424 respectively (Additional file 4). The fish assemblies tended to be the lowest quality of the three species with the REAPR analysis suggesting that only 68% of the bases across all competitive fish assemblies are error free compared to 73% and 75% of the bird and snake assemblies (Additional file 4).
Genome size alone does not seem to be a factor in the relative increases in quality of the snake — and to a lesser extent, bird — assemblies relative to that of fish; the snake (boa constrictor) is estimated to have the largest genome of the three species (1.6 Gbp vs 1.2 Gbp for bird and 1.0 Gbp for fish). The most likely factors that account for these interspecific differences would be the differing levels of heterozygosity and/or repeat content in the three genomes. In agreement with this hypothesis, Lake Malawi cichlids are known to be highly genetically diverse, resulting from extensive hybridization that leads to high levels of heterozygosity [52, 53]. The REAPR analysis also suggests that repeats in the fish genome might pose more of an issue for assemblers than the bird and snake genomes. The extent to which repeat content and heterozygosity made it harder to assemble the bird and snake genomes is less clear. The RepeatMasker analysis of the Fosmid sequences revealed there to be more repeats in snake than bird (13.2% of total Fosmid sequence length vs 8.6%). Observed heterozygosity for boa constrictors has been shown to be in the range of 0.36–0.42  whereas a much wider range (0.1–0.8) has been described for budgerigars (Zhang and Jarvis, personal communication).
Aside from possible differences in the genomes of the three species, it should also be noted that there are interspecific differences with regards to the sequencing data that was available for each species. For example, all of the short-insert Illumina libraries in fish had overlapping paired reads, whereas in snake they were all non-overlapping (see Additional file 1). These differences mean that assemblers that were well-suited for working with the fish data may not be as suited for working with the snake data, and vice versa. The bird genome was different to the other species in having many more libraries available with many different insert sizes (see Additional file 1) and also in having sequencing data available from three different platforms (discussed below, see ‘The effects of combining different data types’).
Bird assembly overview
For the budgerigar genome, we found the BCM-HGSC assembly to be the highest ranked competitive assembly when using the sum z-score approach (Figure 17). However, if evaluation assemblies are included then the BCM-HGSC evaluation entry (BCM*) produces a notably higher sum z-score; the reasons for the differences between these two BCM-HGSC assemblies are discussed below (see ‘The effects of combining different data types’). The two evaluation assemblies from the SOAPdenovo team (SOAP* and SOAP**) also rank higher than the competitive SOAP entry.
Among the competitive assemblies, the Allpaths and Newbler entries rank closely behind the assembly from the BCM-HGSC team in terms of overall z-score. While the competitive BCM-HGSC entry performs well across many of the key metrics, it would not be ranked 1st overall if any one of three different key metrics (CEGMA, scaffold NG50, and contig NG50) had been excluded from the calculation of the z-score. The overall heterogeneity of the bird assemblies is further underlined by the fact that 6 of the 14 entries (including evaluation assemblies) would rank 1st (or joint 1st) when ranked separately by each of the ten key metrics.
Ordering the assemblies by their average rank rather than by z-score produces a slightly different result (Additional file 2: Figure S9). Assemblies from BCM-HGSC and Allpaths switch 1st and 2nd places, but both are still placed behind the BCM* assembly. The CBCB entry ranks higher using this method, moving to 3rd place among competitive entries.
It should be noted that the top three-ranked competitive bird assemblies each used a very different combination of sequencing data: BCM-HGSC used Illumina + Roche 454 + PacBio, Allpaths only used Illumina, and Newbler only used Roche 454. Therefore, the similarity in overall assembly rankings should be weighed against the different costs that each strategy would require.
Fish assembly overview
The lack of Fosmid sequence data for the Lake Malawi cichlid removed three of the key metrics that were used for the other two species. Overall, the fish assemblies could broadly be divided into three groups with the first group consisting of the most highly ranked assemblies generated by the teams BCM-HGSC, CSHL, Symbiose and Allpaths (Figure 18). The BCM-HGSC assembly scored highly in most key metrics, and excelled in the measure of scaffold N50 length. This is the only key metric which, if excluded, would remove the BCM-HGSC assembly from 1st place when using a z-score ranking system. An ordering of assemblies based on their average rank provides only minor differences to that of the z-score ranking (Additional file 2: Figure S10).
The CSHL team submitted two additional evaluation assemblies for fish. Their competitive assembly ranked 2nd overall, and was produced by the Metassembler tool  which combined the results of two separate assemblies (CSHL* and CSHL**, that were made using the Allpaths and SOAPdenovo assemblers respectively). Both of these CSHL evaluation assemblies were produced using the default parameters for the assembly software in question (though for the SOAPdenovo assembly, CSHL team also used Quake  to error correct the reads before assembly). The CSHL Allpaths assembly ranked slightly higher than the competitive entry from the Allpaths team, though this is only apparent in the z-score rankings (they produce the same average rank, Additional file 2: Figure S10). In contrast, the CSHL SOAPdenovo entry ranked much lower than the evaluation assembly from the SOAPdenovo team entry (SOAP*).
Snake assembly overview
The snake assemblies provided the clearest situation where one competitive assembly outperformed all of the others. The SGA assembly scored highly in eight of the ten key metrics, producing a final z-score that was notably higher than that of the Phusion assembly that ranked 2nd (Figure 19). If any one of the ten key metrics were removed from the analysis, the SGA assembly would still rank 1st by either ranking method. Ordering the assemblies by their average rank produced a near identical ranking when compared to using z-scores (Additional file 2: Figure S11).
It should be noted that the SOAPdenovo entry was generated at a time when some of the Illumina mate-pair libraries were temporarily mislabelled in the data instruction file (details of 4 Kbp and 10 Kbp libraries were mistakenly switched). The fact that their assembly was produced with incorrectly labeled data was not noticed until all assemblies had been evaluated, and this may therefore have unfairly penalized their entry. A corrected assembly has subsequently been made available  which provides an approximate 6-fold increase in the scaffold NG50 length compared to the original entry.
Despite the apparent pre-eminence of the SGA assembler it should still be noted that this assembly only ranked 1st in one of the ten key metrics that was used and ranked 7th in another (the amount of gene-sized scaffolds). Furthermore, seven different assemblies would rank 1st if assessed by individual metrics from the set of ten. This reinforces the challenges of assessing the overall quality of a genome assembly when using multiple metrics.
Assembler performance across all three species
SGA, BCM-HGSC, Meraculous, and Ray were the only teams to provide competitive assemblies for all three species (SOAPdenovo provided entries for all species, but only included an evaluation assembly for fish). However, other teams included assemblies for at least two of the species so it is possible to ask how many times did an assembler rank 1st for any of the key metrics that were evaluated. Theoretically, an assembler could be ranked 1st in 27 different key metrics (ten each for bird and snake, and seven for fish).
Excluding the evaluation entries, we observed that assemblies produced by the BCM-HGSC team achieved more 1st place rankings (five) than any other team. Behind the BCM-HGSC team were Meraculous and Symbiose (four 1st place rankings each), and the Ray team (three 1st place rankings). The Meraculous assembler was notably consistent in its performance in the same metric across different species, ranking 1st, 2nd, and 1st in the level 1 coverage of the optical maps (for bird, fish, and snake respectively). The result for Ray is somewhat surprising as the three Ray assemblies only ranked 7th, 7th and 9th overall among competitive entries (for bird, fish, and snake respectively).
These analyses reveal that — at least in this competition — it is very hard to make an assembly that performs consistently when assessed by different metrics within a species, or when assessed by the same metrics in different species.
The effects of combining different data types in bird
For the bird genome, we provided three different types of sequencing data: Illumina, Roche 454, and Pacific Biosciences (PacBio). However, only four teams attempted to combine sequence data from these different platforms in their final assemblies.
The BCM-HGSC team used all three types of sequence data in their competitive entry (BCM), but did not use the PacBio data in their evaluation entry (BCM*). For their competitive assembly, PacBio data were used to fill scaffolding gaps (runs of Ns), but otherwise this assembly was generated in the same way as the evaluation entry. Although gap-filling in this manner led to longer contigs (Additional file 2: Figure S1), the overall effect was to produce a lower-ranked assembly (Figure 17). This is because inclusion of the higher error-rate PacBio data led to a marked decrease in the coverage and validity measures produced by COMPASS. This, in turn, was because the Lastz tool  that was used for alignment was run with a zero penalty for ambiguous characters (Ns), rather than the default penalty score. Consequently, errors in PacBio sequence used in the scaffolding gaps caused breaks in alignments and exclusion of shorter alignments between gaps. If this setting were changed to penalize matches to ambiguous bases in the same way as mismatched unambiguous bases, then it would likely reverse the rankings of these two assemblies.
In addition to a competitive entry, the SOAPdenovo team included two evaluation assemblies for bird which both ranked higher than their competitive entry. The evaluation entries differed in using only Illumina (SOAP*) or Illumina plus Roche 454 (SOAP**) data. Inclusion of the Roche 454 data contributed to a markedly better assembly (Figure 18), but again this was mostly achieved through increased coverage and validity when compared to the SOAP* assembly.
The other teams that combined sequencing data were the CBCB team that used all three types of data and the ABL team which used Illumina plus Roche 454 data. Both of these teams only submitted one assembly so it is not possible to accurately evaluate the effect of combining data sets compared to an assembly which only used one set of sequence data. The CBCB team have separately reported on the generation of their entry, as well as additional budgerigar assemblies, and have described the effects of correcting PacBio reads using shorter Illumina and 454 reads. Their assembly performed competently when assessed by most metrics, but was penalized by much lower NG50 scaffold lengths compared to other bird assemblies. It should also be noted that the Ray assembler  that was used for the Ray fish assembly, was designed to work with Illumina and Roche 454 reads, but this team chose to only use the Illumina data in their assembly.
Overall, the bird assemblies that attempted to combine multiple types of sequencing data ranked 1st, 2nd, 5th, 7th, and 14th when assessed by all key metrics. The two assemblies that included PacBio data (BCM and CBCB) had the highest and second-highest contig NG50 lengths among all competitive bird assemblies (Additional file 2: Figure S1), suggesting that inclusion of PacBio data may be particularly useful in this regard. However, it may be desirable to correct the PacBio reads using other sequencing data as was done by the CBCB team, a process that may have been responsible for the higher values of coverage and validity in this assembly compared to the BCM-HGSC entry.
Aside from differences in assembly quality, it should also be noted that the generation of raw sequence data from multiple platforms will typically lead to an increase in sequencing costs. This was not an aspect factored into this evaluation, but should be an important consideration for those considering mixing different data types. It should also be pointed out that not all assemblers are designed to work with data from multiple sequencing platforms.
Size isn’t everything
Assemblies varied considerably in size, with some being much bigger or smaller than the estimated genome size for the species in question. However, very large or small assemblies may still rank highly across many key metrics. For example, among competitive entries, the Ray team generated the smallest fish assembly (~80% of estimated genome size), but this had the second highest REAPR summary score (Additional file 4). The PRICE snake assembly was excluded from detailed analysis because it accounted for less than 25% of the estimated snake genome size. This team used their own assembler  and implemented a different strategy to that used by other teams, focusing only on assembling the likely genic regions of the snake genome. They did this by looking for matches from the input read data to the gene annotations from the green lizard (Anolis carolinensis); this being the closest species to snake that has a full set of genome annotations. While their assembly only comprises ~10% of the estimated genome size for the snake, it contains almost three-quarters (332 out of 438) of the core eukaryotic genes that are present across all snake assemblies (see Additional file 4). While this is still fewer than any other snake assembly, it would be ranked highest if evaluating assemblies in terms of ‘number of core genes per Mbp of assembly (Additional file 2: Figure S17).
Lessons learned from assemblathon 2
The clear take-home message from this exercise is the lack of consistency between assemblies in terms of interspecific as well as intraspecific comparisons. An assembler may produce an excellent assembly when judged by one approach, but a much poorer assembly when judged by another. The SGA snake assembly ranked 1st overall, but only ranked 1st in one individual key metric, and ranked 5th and 7th in others. Even when an assembler performs well across a range of metrics in one species, it is no guarantee that this assembler will work as well with a different genome. The BCM-HGSC team produced the top ranking assembly for bird and fish, but a much lower-ranked assembly for snake. Comparisons between the performance of the same assembler in different species are confounded by the different nature of the input sequence data that was provided for each species.
By many metrics, the best assemblies that were produced were for the snake, a species that had a larger genome than the other two species, but which had fewer repeats than the bird genome (as assessed by RepeatMasker analysis). The snake dataset also had the lowest read coverage of all three species, with less than half the coverage of the bird (Table 2). Higher levels of heterozygosity in the two other genomes are likely to be responsible for these differences.
We used ten ‘key metrics’ which each capture a slightly different facet of assembly quality. It is apparent that using a slightly different set of metrics could have produced a very different ranking for many of the assemblies (Figures 17, 18 and 19, Additional file 2: Figures S9–S11). Two of these key metrics are based on alignments of scaffolds to optical maps and these metrics sometimes revealed very different pictures of assembly quality. For example, the SGA fish assembly had very high level 1 coverage of the optical map, reflecting global alignments that indicate scaffolds lacking assembly problems. In contrast, this assembly ranked below average for the total coverage (levels 1–3) of the optical maps. This suggests that many other assemblies were better at producing shorter regions of scaffolds that were accurate, even if those scaffolds were chimeric.
N50 scaffold length — a measure that came to prominence in the analysis of the draft human genome sequence  — remains a popular metric. Although it was designed to measure the contiguity of an assembly, it is frequently used as a proxy by which to gauge the quality of a genome assembly. Continued reliance on this measure has attracted criticism (e.g., ) and others have proposed alternative metrics, such as ‘normalized N50’  to address some of the criticisms. As in Assemblathon 1 , we find that N50 remains highly correlated with our overall rankings (Figure 20). However, it may be misleading to rely solely on this metric when assessing an assembly’s quality. For example, the SOAP bird assembly has the 2nd highest N50 length but ranked 6th among competitive assemblies based on the overall z-score. Conversely, assemblies with low scaffold N50 lengths may excel in one or more specific measures of assembly quality; for example, the Ray snake assembly ranked 9th for N50 scaffold length but ranked 1st in the two COMPASS metrics of coverage and validity.
Recently, another assembly quality metric has been proposed that uses alignments of paired-end and mate-pair reads to an assembly to generate Feature-Response Curves (FRC) [15, 64]. This approach attempts to capture a trade-off between accuracy and continuity, and has recently been used to assess a number of publicly available genome assembly datasets including the snake assemblies that were submitted for Assemblathon 2 . The authors used the read alignments to generate a number of features which can be evaluated separately or combined for an overall view of assembly accuracy. They identified SGA and Meraculous as producing the highest ranking assemblies, results which agree with our findings (SGA and Meraculous ranked 1st and 3rd). They also echoed our conclusions that focusing on individual metrics can often produce different rankings for assemblers.
Combining multiple assemblies from different assembly pipelines in order to produce an improved assembly was an approach used in the assembly of the rhesus macaque genome . It might therefore be expected that an improved assembly could be made for each of the three species in this study. The results from the CEGMA analysis (Figure 5) indicate that this may be possible, at least in terms of the genic content of an assembly. Three fish assemblies (CSHL, CSHL*, and SOAP*) were all found to contain the most core genes (436 out of 458 CEGs), but 455 CEGs were present across all assemblies. Combining assemblies is the approach that the Genomic Assemblies Merger (GAM) team used for their snake assembly. The GAM program)  combined separate assemblies produced by the CLC and ABySS assemblers [8, 68]. However, the resulting assembly scored poorly in most metrics. In contrast, the metassembler entry from the CSHL team produced a high-ranking assembly, but one that was only marginally better than the two source assemblies that it was based on.
One important limitation of this study is that we did not assess the degree to which different assemblers resolved heterozygous regions of the genome into separate haplotypes. Therefore we do not know whether the larger-than-expected assemblies may simply reflect situations where an assembler successfully resolved a highly heterozygous region into two separate contigs. Some assemblers are known to combine such contigs into one scaffold where the heterozygous region appears as a spurious segmental duplication . Many assemblers only produce only a haploid consensus version of a target diploid genome. This is partly a limitation of the FASTA file format and a current effort to propose a new assembly file format is ongoing. This FASTG format  is intended to allow representation of heterozygous regions (and other uncertainties) and could lead to more accurate assessments of genome assembly quality in future.
A final, but important, point to note is that many of the assemblies entered into this competition were submitted by the authors of the software that was used to create the assembly. These entries might therefore be considered to represent the best possible assemblies that could be created with these tools; third-party users may not be able to produce as good results without first gaining considerable familiarity with the software. Related to this point are the issues of: ‘ease of installation’, ‘quality of documentation’ and ‘ease of use’ of each assembly tool. These might also be considered important metrics to many end users of such software. We did not assess these qualities and prospective users of such software should be reminded that it might not be straightforward to reproduce any of the assemblies described in this study.
Practical considerations for de novo genome assembly
Based on the findings of Assemblathon 2, we make a few broad suggestions to someone looking to perform a de novo
assembly of a large eukaryotic genome:
Don’t trust the results of a single assembly. If possible, generate several assemblies (with different assemblers and/or different assembler parameters). Some of the best assemblies entered for Assemblathon 2 were the evaluation assemblies rather than the competition entries.
Do not place too much faith in a single metric. It is unlikely that we would have considered SGA to have produced the highest ranked snake assembly if we had only considered a single metric.
Potentially choose an assembler that excels in the area you are interested in (e.g., coverage, continuity, or number of error free bases).
If you are interested in generating a genome assembly for the purpose of genic analysis (e.g., training a gene finder, studying codon usage bias, looking for intron-specific motifs), then it may not be necessary to be concerned by low N50/NG50 values or by a small assembly size.
Assess the levels of heterozygosity in your target genome before you assemble (or sequence) it and set your expectations accordingly.