Chromosomer: a reference-based genome arrangement tool for producing draft chromosome sequences
© The Author(s) 2016
Received: 8 December 2015
Accepted: 31 July 2016
Published: 22 August 2016
As the number of sequenced genomes rapidly increases, chromosome assembly is becoming an even more crucial step of any genome study. Since de novo chromosome assemblies are confounded by repeat-mediated artifacts, reference-assisted assemblies that use comparative inference have become widely used, prompting the development of several reference-assisted assembly programs for prokaryotic and eukaryotic genomes.
We developed Chromosomer – a reference-based genome arrangement tool, which rapidly builds chromosomes from genome contigs or scaffolds using their alignments to a reference genome of a closely related species. Chromosomer does not require mate-pair libraries and it offers a number of auxiliary tools that implement common operations accompanying the genome assembly process.
Despite implementing a straightforward alignment-based approach, Chromosomer is a useful tool for genomic analysis of species without chromosome maps. Putative chromosome assemblies by Chromosomer can be used in comparative genomic analysis, genomic variation assessment, potential linkage group inference and other kinds of analysis involving contig or scaffold mapping to a high-quality assembly.
KeywordsReference-assisted assembly Chromosome assembly Alignment
Chromosome assembly is an important part of virtually any eukaryotic genome project. The number of assembled genomes increases each year and many of them are anchored to physical chromosome maps . A robust de novo chromosome assembly requires not only mate-pair reads with different insert sizes, but also physical and genetic maps [2–4]. The large number of high quality assembled ‘reference genomes’ leads to an alternative approach – a reference-assisted chromosome assembly. Using this approach, the benefits of assembled chromosomes can be exploited without additional sequencing or map construction. These benefits include a known number of linkage groups and an estimated distance between markers, which is important for inferences of linkage and synteny. An assisted assembly also connects and orders large numbers of small contigs or scaffolds based on comparative analysis. In many cases, the initial number of contigs and scaffolds can exceed several hundred thousand following de novo assembly; working with such a fragmented genome can prove challenging . Arranging contigs and scaffolds into putative chromosomes using information from the reference genome of a closely related species reduces the overall number of fragments from thousands to hundreds or dozens and also simplifies the annotation and analysis of different genomic features such as repeats, genes, single-nucleotide polymorphisms, copy number variations and segmental duplications.
A disadvantage of this approach is the introduction of occasional assembly errors driven by evolutionary chromosomal rearrangements. Even a closely related reference can differ in synteny from the target genome to some degree. The number of introduced assembly artifacts generally correlates with the evolutionary distance between the target and reference genomes  although rates of chromosome rearrangements are hardly clock-like, at least for mammals [7, 8]. These assembly artifacts are easily corrected if a physical map for the target genome is developed, using a tool such as the single molecule next-generation mapping system (Irys) developed by BioNano Genomics .
Multiple programs have been developed for reference-assisted chromosome assembly: Bambus , BACCardI , Projector2 , OSLay , ABACAS , MeDuSa , AlignGraph , Ragout , SyMap  and RACA . Most of the listed tools were designed for bacterial or small genomes. For example, ABACAS is a convenient bacterial genome contiguation tool that may also be used for small eukaryotic genomes such as Saccharomyces cerevisiae (12.1 mega base pairs). However, ABACAS is not efficiently scaled to use with the large genomes typical of vertebrate species.
SyMap was designed to facilitate reference-assisted chromosome assembly for eukaryotic genomes; however, it has important limitations. SyMap uses MUMmer  or NUCmer  for the alignment phase, requires a separate structured query language (SQL) database to work efficiently and takes a very long time to align large genomes to each other.
The most promising approach for reference-assisted assembly is based on using several reference genomes instead of a single one. RACA implements such an approach, using alignments of target, reference and outgroup genomes as inputs to generate predicted chromosome fragments (PCFs) . However, RACA also requires additional evidence from mate-pair libraries for joining genome fragments, while most de novo sequenced genomes have no such libraries available. Furthermore, RACA requires extensive computations for assembling chromosomes.
In this paper we introduce Chromosomer – an open-source cross-platform software that automates the reference-assisted building of genomic chromosomes and is especially effective for large genomes (> 1 giga base pairs). Chromosomer constructs draft chromosomes based only on alignments between fragments (contigs or scaffolds) to be arranged and a reference genome, thereby improving analytical and annotation opportunities for the index species assembly. Although Chromosomer does not use any sophisticated models or algorithms for chromosome assembly, we show that its results are comparable with state-of-the-art assemblies and can be used for further genomic analysis.
To map fragments to a reference genome, Chromosomer uses results of pairwise alignments between the fragments (contigs and scaffolds) and the chromosomes of the reference genome. The alignments are required to have associated score values that reflect the length and identity of the aligned regions (for example, the BLAST bit score ). In addition, the start and end positions of aligned regions in both the fragments and the reference chromosomes are required.
From pairwise alignments, determine fragments that can be anchored to a reference according to the ratio of their first and second greatest alignment scores. If the ratio is greater than the predefined threshold, which is the algorithm parameter, then the fragment is anchored to a position corresponding to its alignment with the greatest score. Otherwise, the fragment is considered unplaced if these two alignments are located on different reference chromosomes or unlocalized if both alignments are located on the same chromosome.
Produce a map describing fragment positions at a reference genome and output assembled chromosome sequences and lists of unlocalized and unplaced fragments.
transfer annotations from fragments to assembled chromosomes using a fragment map;
visualize a reference-assisted chromosome assembly as a genome browser track containing fragment positions;
obtain statistics on a reference-assisted chromosome assembly.
We further describe several aspects of the Chromosomer workflow: mapping fragments to reference chromosomes, transferring annotations from fragments to the assembled chromosomes and defining parameters that tune the Chromosomer assembly process. We consider all sequence coordinates to be zero-based and half-opened (that is, the first nucleotide is considered as position 0 and the last nucleotide position is equal to the sequence length).
Mapping fragments to reference genome
Transferring annotations to assembled chromosomes
Chromosomer introduces two parameters that influence the assembly process. The first parameter is the alignment score ratio threshold, which is used to distinguish anchored and unplaced fragments. If the score ratio of the two fragment alignments with the highest scores exceeds the threshold, then the fragment is considered anchored, otherwise it is considered unplaced and is excluded from further analysis. The alignment score ratio threshold must be a positive number greater than one.
The second parameter is the insertion size – the size of a gap which is inserted between overlapping regions (see Fig. 3 b). The insertion size is recommended to be equal to or greater than the sequencing library size.
Chromosomer assembly evaluation
Escherichia coli Sakai strain (E. coli K-12 strain as a reference);
Saccharomyces cerevisiae CLIB324 strain (S. cerevisiae S288c strain as a reference);
Pantholops hodgsonii (Tibetan antelope; Bos taurus as a reference);
Pan troglodytes (chimpanzee; Homo sapiens as a reference).
We also assembled the bacterial and yeast genomes using ABACAS and compared ABACAS-derived assemblies with Chromosomer-derived ones. Although ABACAS is not designed for assembling multichromosome genomes, we used separate ABACAS runs for each chromosome from the reference genome. The Chromosomer assembly of Tibetan antelope was compared with the RACA assembly presented in . The Chromosomer-derived chimpanzee chromosomes were assessed by comparison with the GenBank assembly and by checking the coding region accuracy. LASTZ  was used to perform whole-genome alignments for assessing chromosomes obtained with Chromosomer.
Escherichia coli assembly
Comparison of ABACAS and Chromosomer E. coli Sakai strain assemblies
Mean identity (%)
Mean length (in bp)
Mean mismatches (in bp)
Coverage (in bp)
Saccharomyces cerevisiae assembly
Comparison of ABACAS and Chromosomer S. cerevisiae CLIB324 strain assemblies
Mean identity (%)
Mean length (in bp)
Mean mismatches (in bp)
Coverage (in bp)
Pantholops hodgsonii assembly
The P. hodgsonii genome was assembled from its scaffolds (GenBank accession number GCA_000400835.1) using the B. taurus UMD3.1 assembly as a reference and the net alignments between the scaffolds and the cow chromosomes from . The fragment map of the Chromosomer-derived Tibetan antelope chromosomes is given in Additional file 6.
The PCFs assembled by RACA tend to be longer than the original reference genome (cow) chromosomes.
RACA predicted two chromosomal translocations in the Tibetan antelope genome compared with the cow genome: the first one between chromosomes 7 and 10 and the second one between chromosomes 21 and 27. The predicted translocations led to elongation of chromosome 7 and shortening of chromosome 10; chromosomes 21 and 27 are also related in the same way but to a lesser extent (see Fig. 7). The ability to detect cross-species rearrangements is a feature of RACA that is related to its more complex assembly model and integration of paired-end reads, which Chromosomer does not use.
In addition, Chromosomer demonstrated better time performance and required fewer computational resources than RACA. It took about 1.7 hours and 1.5 GB of random access memory (RAM) for Chromosomer to produce the chromosomes from the net alignments using one CPU (central processing unit). RACA spent 55 hours and required 59 GB of RAM using three CPUs to get the result from the same net alignments. We used the SuperMicro server for the benchmark (12 Intel Xeon E5-2690 CPUs and 396 GB RAM).
Pan troglodytes assembly
From the examples shown above, we conclude that Chromosomer is comparable to existing reference- genome assembly tools and is able to assemble and process large genomes. Chromosomer may increase efficiency of genome annotation studies by replacing numerous genome fragments with draft chromosome assemblies.
Availability and requirements
Project name: Chromosomer
Project home page: https://github.com/gtamazian/chromosomer
Operating systems: Platform independent
Programming languages: Python
Other requirements: Python 2.7
License: BSD 3-Clause License
Any restriction to use by non-academics: none
This work was supported by Russian Ministry of Science (11.G34.31.0068 to SJO).
Availability of supporting data
Data furthering supporting this paper is available in the GigaScience repository, GigaDB .
PD, GT and KPK conceived the project. GT, PD, KK, AK and SJO supervised the project. GT implemented the tool. GT, PD and KK designed and described the usage examples. GT, PD, KK, AK, KPK and SJO composed and revised the manuscript. All authors read and approved the final manuscript.
The authors declare that they have no competing interests.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Haussler D, O’Brien SJ, Ryder OA, Barker FK, Clamp M, Crawford AJ, Hanner R, Hanotte O, Johnson WE, McGuire JA, et al. Genome 10K: a proposal to obtain whole-genome sequence for 10,000 vertebrate species. J Hered. 2008; 100(6):659–74.Google Scholar
- Mavrich TN, Jiang C, Ioshikhes IP, Li X, Venters BJ, Zanton SJ, Tomsho LP, Qi J, Glaser RL, Schuster SC, et al. Nucleosome organization in the Drosophila genome. Nature. 2008; 453(7193):358–62.View ArticlePubMedPubMed CentralGoogle Scholar
- McPherson JD, Marra M, Hillier L, Waterston RH, Chinwalla A, Wallis J, Sekhon M, Wylie K, Mardis ER, Wilson RK, et al. A physical map of the human genome. Nature. 2001; 409(6822):934–41.View ArticlePubMedGoogle Scholar
- Lewin HA, Larkin DM, Pontius J, O’Brien SJ. Every genome sequence needs a good map. Genome Res. 2009; 19(11):1925–8.View ArticlePubMedPubMed CentralGoogle Scholar
- Murchison EP, Schulz-Trieglaff OB, Ning Z, Alexandrov LB, Bauer MJ, Fu B, Hims M, Ding Z, Ivakhno S, Stewart C, et al. Genome sequencing and analysis of the Tasmanian devil and its transmissible cancer. Cell. 2012; 148(4):780–91.View ArticlePubMedPubMed CentralGoogle Scholar
- Luo H, Arndt W, Zhang Y, Shi G, Alekseyev MA, Tang J, Hughes AL, Friedman R. Phylogenetic analysis of genome rearrangements among five mammalian orders. Mol Phylogenet Evol. 2012; 65(3):871–82.View ArticlePubMedPubMed CentralGoogle Scholar
- O’Brien SJ, Menotti-Raymond M, Murphy WJ, Nash WG, Wienberg J, Stanyon R, Copeland NG, Jenkins NA, Womack JE, Graves JAM. The promise of comparative genomics in mammals. Science. 1999; 286(5439):458–81.View ArticlePubMedGoogle Scholar
- Murphy WJ, Larkin DM, Everts-van der Wind A, Bourque G, Tesler G, Auvil L, Beever JE, Chowdhary BP, Galibert F, Gatzke L, et al. Dynamics of mammalian chromosome evolution inferred from multispecies comparative maps. Science. 2005; 309(5734):613–7.View ArticlePubMedGoogle Scholar
- BioNano Genomics. Whole Genome Mapping with the Irys System. http://bionanogenomics.com. Accessed 13 Aug 2016.
- Pop M, Kosack DS, Salzberg SL. Hierarchical scaffolding with Bambus. Genome Res. 2004; 14(1):149–59.View ArticlePubMedPubMed CentralGoogle Scholar
- Bartels D, Kespohl S, Albaum S, Drüke T, Goesmann A, Herold J, Kaiser O, Pühler A, Pfeiffer F, Raddatz G, et al. BACCardI — a tool for the validation of genomic assemblies, assisting genome finishing and intergenome comparison. Bioinformatics. 2005; 21(7):853–9.View ArticlePubMedGoogle Scholar
- van Hijum SA, Zomer AL, Kuipers OP, Kok J. Projector 2: contig mapping for efficient gap-closure of prokaryotic genome sequence assemblies. Nucleic Acids Res. 2005; 33(suppl 2):560–6.View ArticleGoogle Scholar
- Richter DC, Schuster SC, Huson DH. OSLay: optimal syntenic layout of unfinished assemblies. Bioinformatics. 2007; 23(13):1573–9.View ArticlePubMedGoogle Scholar
- Assefa S, Keane TM, Otto TD, Newbold C, Berriman M. ABACAS: algorithm-based automatic contiguation of assembled sequences. Bioinformatics. 2009; 25(15):1968–9.View ArticlePubMedPubMed CentralGoogle Scholar
- Bosi E, Donati B, Galardini M, Brunetti S, Sagot MF, Lió P, Crescenzi P, Fani R, Fondi M. MeDuSa: a multi-draft based scaffolder. Bioinformatics. 2015; 31(15):2443–51.View ArticlePubMedGoogle Scholar
- Bao E, Jiang T, Girke T. AlignGraph: algorithm for secondary de novo genome assembly guided by closely related references. Bioinformatics. 2014; 30(12):319–28.View ArticleGoogle Scholar
- Kolmogorov M, Raney B, Paten B, Pham S. Ragout—a reference-assisted assembly tool for bacterial genomes. Bioinformatics. 2014; 30(12):302–9.View ArticleGoogle Scholar
- Soderlund C, Bomhoff M, Nelson WM. SyMAP v3.4: a turnkey synteny system with application to plant genomes. Nucleic Acids Res. 2011; 39(10):68–8.View ArticleGoogle Scholar
- Kim J, Larkin DM, Cai Q, Zhang Y, Ge RL, Auvil L, Capitanu B, Zhang G, Lewin HA, Ma J, et al. Reference-assisted chromosome assembly. Proc Natl Acad Sci. 2013; 110(5):1785–90.View ArticlePubMedPubMed CentralGoogle Scholar
- Delcher AL, Salzberg SL, Phillippy AM. Using MUMmer to identify similar regions in large sequence sets. Curr Protocol Bioinforma. 2003;10.
- Delcher AL, Phillippy A, Carlton J, Salzberg SL. Fast algorithms for large-scale genome alignment and comparison. Nucleic Acids Res. 2002; 30(11):2478–83.View ArticlePubMedPubMed CentralGoogle Scholar
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990; 215(3):403–10.View ArticlePubMedGoogle Scholar
- Harris RS. Improved pairwise alignment of genomic DNA. PhD thesis, The Pennsylvania State University. 2007.
- Nurk S, Bankevich A, Antipov D, Gurevich A, Korobeynikov A, Lapidus A, Prjibelsky A, Pyshkin A, Sirotkin A, Sirotkin Y, et al. Assembling genomes and mini-metagenomes from highly chimeric reads. In: Research in Computational Molecular Biology. Berlin, Heidelberg: Springer: 2013. p. 158–70.Google Scholar
- Kent WJ, Baertsch R, Hinrichs A, Miller W, Haussler D. Evolution’s cauldron: duplication, deletion, and rearrangement in the mouse and human genomes. Proc Natl Acad Sci. 2003; 100(20):11484–9.View ArticlePubMedPubMed CentralGoogle Scholar
- Rosenbloom KR, Armstrong J, Barber GP, Casper J, Clawson H, Diekhans M, Dreszer TR, Fujita PA, Guruvadoo L, Haeussler M, et al. The UCSC genome browser database: 2015 update. Nucleic Acids Res. 2015; 43(D1):670–81.View ArticleGoogle Scholar
- Tamazian G, Dobrynin P, Krasheninnikova K, Komissarov A, Koepfli K-P, O’Brien SJ. Supporting data for “Chromosomer: a reference-based genome arrangement tool for producing draft chromosome sequences”. GigaScience Database. 2016. http://dx.doi.org/10.5524/100210.