OPTIMA: sensitive and accurate wholegenome alignment of errorprone genomic maps by combinatorial indexing and technologyagnostic statistical analysis
 Davide Verzotto^{1}Email author,
 Audrey S. M. Teo^{2},
 Axel M. Hillmer^{2} and
 Niranjan Nagarajan^{1}Email author
DOI: 10.1186/s1374201601100
© Verzotto et al. 2016
Received: 9 May 2015
Accepted: 6 January 2016
Published: 19 January 2016
Abstract
Background
Resolution of complex repeat structures and rearrangements in the assembly and analysis of large eukaryotic genomes is often aided by a combination of highthroughput sequencing and genomemapping technologies (for example, optical restriction mapping). In particular, mapping technologies can generate sparse maps of large DNA fragments (150 kilo base pairs (kbp) to 2 Mbp) and thus provide a unique source of information for disambiguating complex rearrangements in cancer genomes. Despite their utility, combining highthroughput sequencing and mapping technologies has been challenging because of the lack of efficient and sensitive mapalignment algorithms for robustly aligning errorprone maps to sequences.
Results
We introduce a novel seedandextend glocal (short for globallocal) alignment method, OPTIMA (and a slidingwindow extension for overlap alignment, OPTIMAOverlap), which is the first to create indexes for continuousvalued mapping data while accounting for mapping errors. We also present a novel statistical model, agnostic with respect to technologydependent error rates, for conservatively evaluating the significance of alignments without relying on expensive permutationbased tests.
Conclusions
We show that OPTIMA and OPTIMAOverlap outperform other stateoftheart approaches (1.6−2 times more sensitive) and are more efficient (170−200 %) and precise in their alignments (nearly 99 % precision). These advantages are independent of the quality of the data, suggesting that our indexing approach and statistical evaluation are robust, provide improved sensitivity and guarantee high precision.
Keywords
Optical mapping Genomic mapping Glocal alignment Overlap alignment Maptosequence alignmentBackground
In recent years, the availability of commercial platforms for highthroughput genome mapping (from, for example, OpGen [1], BioNano Genomics [2] and Nabsys [3] have increased the interest in using these technologies, in combination with highthroughput sequencing data, for applications such as structural variation analysis and genome assembly. In particular, several recent genome assembly projects have highlighted their utility for obtaining highquality assemblies of large eukaryotic genomes (for example, goat [4] and budgerigar [5] genomes) or studying complex genomic regions [6] and cancer genomes [7]. Mapping technologies typically provide sparse information (an ordered enumeration of fragment sizes between consecutive genomic patterns, for example, restriction sites) for very large fragments of DNA (150 kilo base pairs (kbp) to 2 Mbp) and are thus orthogonal in utility to sequencing approaches that provide basepair level information for smaller fragments. Combining these two pieces of information therefore requires effective algorithms to align maps to sequences.
Alignment of maps (typically called Rmaps, for restriction maps [8]) to sequences has been widely studied as an algorithmic problem [9] with a range of practical applications, from genome scaffolding [10] to assembly improvement [11] and validation [12]. The general approach has been to translate sequence data to get in silico maps and compare these to experimentally obtained maps using dynamic programming algorithms. For large genomes and mapping datasets, naive allversusall dynamic programming can be computationally expensive. On the other hand, high error rates in mapping data (optical mapping, for example, can miss one in four restriction sites) has led to the use of modelbased scoring functions for sensitively evaluating alignments [13–15]. These often require prior knowledge and modeling of mapping error rates (for example, fragment sizing errors, false cuts and missing cuts) and can be expensive to compute [13, 16, 17]. Alternative approaches with simpler (nonmodelbased) scoring functions [10] are handicapped by the need to do expensive permutationbased statistical testing to evaluate the significance of alignments, and although recent advances have made this testing more efficient [15], it still scales linearly with genome size. Although these approaches work well for microbial genomes, they typically do not scale well for larger genomes, where they might also have reduced sensitivity. In contrast, commercially available solutions for maptosequence alignment (for example, GenomeBuilder from OpGen) scale better and have been used for the assembly of large eukaryotic genomes [4] but tend to discard a large fraction of the mapping data (more than 90 %) due to reduced sensitivity and correspondingly lead to increased mapping costs for a project.
Mapalignment algorithms are thus faced with the twin challenges of improving sensitivity and precision on the one hand and reducing computational costs for alignment and statistical evaluation on the other hand. An elegant solution to this problem from the field of sequencetosequence alignment is the use of a seedandextend approach [18]. However, because maps represent ordered lists of continuous values, this extension is not straightforward, particularly when multiple sources of mapping errors and their high error rates are taken into account [19]. In addition, because error rates can vary across technologies, and even across different runs on the same machine, it is not clear whether a general sensitive maptosequence aligner is feasible. An efficient statistical testing framework that helps control for false discovery without prior information about error rates is critical for making such an aligner easy to use and applicable across technology platforms.
In this work, we describe how a sorted search index and the use of a composite seeding strategy can help to efficiently and sensitively detect seed maptosequence alignments [20]. Our second contribution is the design of a robust and fast statistical evaluation approach that includes multiple sources of mapping errors in the alignment score and evaluates the significance of the best alignment using all identified feasible solutions. We incorporated these ideas and additional refinements to solve two common alignment problems: glocal alignment, solved with OPTIMA, where an entire map is aligned to a subsequence of a second (typically in silico) map; and overlap alignment, solved with OPTIMAOverlap, where the end of one map is aligned to the beginning of another. When benchmarked against stateoftheart aligners, OPTIMA and OPTIMAOverlap typically provide a strong boost in sensitivity (1.6–2 times) without sacrificing precision of alignments (∼99 %). Moreover, our pilot implementations exhibited runtime improvements over commercially available tools (two times faster than OpGen’s Gentig) and orders of magnitude over published, freely available algorithms and software [10, 13].
Finally, these methods are shown to be robust to variations in error distributions while being agnostic to them, suggesting that the methods can deal with different experimental outcomes of the same technology (for example, different map cards or lane types) as well as being applicable across mapping technologies (with minor modifications for preprocessing of data). Because glocal and overlap alignments form the basis of a range of applications that involve the combination of sequence and mapping data (for example, assembly scaffolding, refinement and validation, structural variation analysis and resolving complex genomic regions), OPTIMA and OPTIMAOverlap can serve as building blocks for these applications, allowing for time and costeffective analyses.
Definitions and problem formulation
Let o _{1},o _{2},…,o _{ m } be the m ordered fragment sizes of an experimentally derived map o and r _{1},r _{2},…,r _{ n } be the n fragment sizes of an in silico map r. For simplicity, we suppose here that m≤n and assume that we can derive standard deviations for in silico fragments, that is, σ _{ j } for r _{ j }, in a technologydependent fashion. In an idealized case, we can define the problem of glocally aligning o to r as a onetoone correspondence between all the fragments of o with a subset of the fragments of r, that is, r _{ l },r _{ l+1},…,r _{ l+m−1} (we could also reverse the roles of o and r here). In practice, many sources of errors affect experimentally derived maps, including missing cuts, false/extra cuts, missing fragments, fragment sizing errors and spurious maps [13]. In silico maps could also be affected by sequencing or assembly errors [10], but these are less likely to affect alignments because typically they are infrequent. To accommodate errors, we extend the definition of correspondence between map fragments to allow for matches between sets of fragments (see Fig. 1 b), as used previously in [10]:
Definition 1 (Feasible match).
where C _{ σ }=3 is an appropriate bound if sizing errors are approximately normally distributed.
Definition 2 (Glocal alignment).
A valid glocal alignment is an ordered set of matches M _{1},M _{2},…,M _{ w } between experimental and in silico fragments, such that all the experimental fragments o _{1},o _{2},…,o _{ m } are aligned to a subset of the in silico fragments r _{ t },r _{ t+1},…,r _{ v }, and both sets are orderly partitioned by all the matches M _{1…w } without overlaps, with w≤m and w≤v−t+1.
Missing fragments, which usually arise from short fragments below the experimental detection limit (for example, 2 kbp), can be handled in this framework by allowing gaps, that is, with the option of ignoring short fragments for the purpose of the C _{ σ } test (Eq. 1).
Definition 3 (Overlap alignment).
A valid overlap alignment is an ordered set of matches M _{1},M _{2},…,M _{ w } that allows experimental maps and in silico maps to only partially align with each other, with M _{1} and M _{ w } each corresponding to an end of one of the maps (for example, Fig. 1 e).
Methods
OPTIMA is the first alignment tool based on the seedandextend paradigm that can deal with erroneous mapping data. The basic paradigm is similar to that used for the alignment of discretevalued sequences (allowing for mismatches and indels) and is as follows. We start by indexing the in silico maps, so that we can use this information later, and find seeds for each experimental map o corresponding to some indexed regions of those sequences. We then extend these seeds by using dynamic programming to try to align the whole experimental map to the corresponding in silico map region. For each map o, feasible solutions – as defined by the index structure, size of the genome and maximum error rate – are then evaluated by a scoring scheme to select the optimal solution. Finally, the statistical significance and uniqueness of the optimal solution are determined by comparing and modeling all the identified feasible solutions.
Indexing continuousvalued seeds
Definition 4 (Composite seed).

(i) the cmer \(r_{j_{1}}, r_{j_{2}}\), corresponding to no false cuts in the in silico map;

(ii) the cmer \(r_{j_{1}}+r_{j_{2}}, r_{j_{3}}\), corresponding to a missing cut in the experimental map (or false cut in the in silico map); and

(iii) the cmer \(r_{j_{1}}, r_{j_{2}}+r_{j_{3}}\), corresponding to a different missing cut in the experimental map (or false cut in the in silico map).
The reference index would then contain all ctuples corresponding to a composite seed, as defined in Definition 4, for each location in the reference map. In addition, to account for false cuts in the experimental map, for each set of consecutive fragments \(o_{i_{1}}\), \(o_{i_{2}}\) and \(o_{i_{3}}\) in the experimental maps, we search for ctuples of the type \(o_{i_{1}}, o_{i_{2}}\) and \(o_{i_{1}} + o_{i_{2}}, o_{i_{3}}\) in the index (see Composite seeds (iv) depicted in Fig. 1 c).
To index the seeds, we adopt a straightforward approach where all ctuples are collected and sorted into the same index in lexicographic order by c _{1} (where the c _{ i } are elements in the ctuple). Lookups can be performed by binary search over fragmentsized intervals that satisfy the C _{ σ } bound for c _{1} and a subsequent linear scan of the other elements c _{ i }, for i≥2, while verifying the C _{ σ } bound in each case. Note that, because seeds are typically expected to be of higher quality, we can apply a more stringent threshold on seed fragment size matches (for example, we used \(C^{Seed}_{\sigma } = 2\)).
As shown in the “Results and discussion” section, this approach significantly reduces the space of candidate alignments without affecting the sensitivity of the search. A comparison between the various seeding approaches is shown in Fig. 2, which highlights the advantages of composite seeds with respect to 2mers.
Overall, the computational cost of finding seeds using this approach is O(m (logn+c # s e e d s _{ c=1})) per experimental map, where n is the total length of the in silico maps in fragments, m≪n is the length of the experimental map and # s e e d s _{ c=1} is the number of seeds found in the first level of the index lookup, before narrowing down the list to the actual number of seeds that will be extended (# s e e d s). The cost and space of creating the reference index is thus O(c n), if the number of errors considered in the composite seeds is limited and bounded (as in Definition 4), and radix sort is used to sort the index. This approach drastically reduces the number of alignments computed in comparison to more general, global alignment searches [10], as will be shown later in the “Results and discussion” section.
Dynamic programmingbased extension of seeds
where the first index of each subscript represents experimental fragments, the second index represents the in silico fragments, s−k is the number of false cuts, t−l is the number of missing cuts, C _{ ce } is a constant larger than the maximum possible total for χ ^{2}, \({\chi }_{k..s, l..t}^{2} = \left (\sum \limits _{i=k}^{s} o_{i}  \sum \limits _{j=l}^{t} r_{j}\right)^{2} / \left (\sum \limits _{j=l}^{t} {\sigma _{j}^{2}}\right)\) and S c o r e _{0,0}=0, S c o r e _{ i,0}=∞ and S c o r e _{0,j }=∞.
Note that a small in silico fragment is considered as missing if this condition allows for a valid alignment that improves the local χ ^{2} on nearby matches by half (up to three consecutive fragments).
As in [16], we band the dynamic programming and its backtracking to avoid unnecessary computation. In particular, as we show in Supplementary Note 1 in Additional file 1, based on parameter estimates for optical mapping data, restricting alignments to eight missing cuts or five false cuts, consecutively, should retain high sensitivity. In addition, we stop the dynamic programmingbased extension if no feasible solutions can be found for the current seed after having analyzed at least f fragments (default of five).
The computational cost of extending a seed (ctuple) of an experimental map with m fragments is thus, in the worst case, O((m−c) δ ^{3}) time, where δ is the bandwidth of the dynamic programming, and O((m−c)^{2}) space for allocating the dynamic programming matrix for each side of the seed.
Statistical significance and uniqueness analysis
where s _{ i }=1 if lower values of feature f _{ i } are preferable and −1 otherwise, and the corresponding pvalue is p _{π}=Pnorm(𝜗(π)), that is, the probability that a ‘random’ Zscore will be less than 𝜗(π) under the standard normal distribution.
which can be subsequently converted into the pvalue p _{π}. The candidate solution π ^{∗} with the lowest pvalue p ^{∗} is reported as the optimal solution, as shown in Fig. 3.
The statistical significance of the optimal solution can then be assessed through a false discovery rate qvalue analysis [22] based on all candidate solutions found for comparable experimental maps, for example, those with the same number of fragments (default of q=0.01). To assess uniqueness, we set a threshold on the ratio of pvalues between the best solution and the nextbest solution (default of five). See Supplementary Note 2 in Additional file 1 for further algorithmic details.
In summary, our statistical scoring approach finds an optimal solution and evaluates its statistical significance and uniqueness in a unified framework, thus allowing for savings in computational time and space compared to a permutation test, without restricting the method to a scenario where experimental error probabilities are known a priori.
Extension to overlap alignment
To extend OPTIMA to compute and evaluate overlap alignments – a key step in assembly pipelines that use mapping data [4, 5, 23] – we use a slidingwindow approach based on OPTIMA. This allows us to continue using the statistical evaluation procedure defined in OPTIMA that relies on learning parameters from comparable alignments (that is, those with the same number, size and order of experimental fragments) in a setting where the final alignments are not always of the same length and structure.
 (i)
a significant and unique submap alignment;
 (ii)
an alignment labeled as nonsignificant and/or nonunique (which will be considered as a false alignment); or
 (iii)
no feasible alignments found.
Optimal solutions to the subproblems are then ranked by pvalue (smallest to largest) and iterated through to select submaps that should be extended. At each stage we check the significance and uniqueness of the reported solutions (compared to the others), as well as checking for potential cases of identical or conflicting alignments as defined below.
Definition 5 (Conflicting alignments).

the submap of π _{1} overlaps the submap of π _{2}; or

π _{1} aligns to the same in silico map as π _{2}, but in a different location or strand.
 1.
Significance: if the current solution π _{ i } is labeled as a false alignment, then we stop iterating through the rest of the list.
 2.
Uniqueness: we skip an alignment if either: (i) π _{ i } represents the same overlap alignment as a more significant solution; (ii) π _{ i } is conflicting with a solution with a lower pvalue (that is, seen before); or (iii) π _{ i } is not unique with respect to other solutions π _{ j } with j>i (that is, having greater pvalues) that it is conflicting with.
 3.
Extension with dynamic programming: optimal overlap solutions are identified according to Eq. 2, where ties are broken in favor of longer valid alignments.
This approach allows us to report multiple overlap alignments (including split alignments) for an experimental map while using the qvalue analysis, as before, to report all alignments with q≤0.01. For the qvalue analysis, we consider all candidate solutions found for the sliding windows in order to learn the qvalue parameters. In addition, we can reuse the dynamic programming matrix computed for each seed across submap alignments and thus complete the overlap alignment with the same asymptotic time and space complexity as the glocal alignment.
Results and discussion
Generation of benchmarking datasets

Easier scenario: d=0.78 (corresponding to missing cut rate of 22 %); f _{100}=0.97; and probability at 0.5 for missing fragments of size below 1.2 kbp, 0.75 below 600 bp and 1 below 350 bp.

Harder scenario: d=0.61 (corresponding to missing cut rate of 39 %); f _{100}=1.38; and 0.5 for missing fragments of size below 2 kbp, 0.75 below 800 bp and 1 below 350 bp.
For each scenario, we first simulate the map sizes using empirically derived distributions from real maps (average size of approximately 275 kbp, containing 17 fragments) and extract the corresponding reference submaps by sampling start locations uniformly from the in silico maps (possibly creating truncated end fragments). Then we introduce cut errors using the probability distributions described in [13] with the above parameters, that is: first, we remove missing cuts following a Binomial distribution with probability 1−d; next, we insert false cuts modeled as a Poisson process with rate f _{100} (avoiding creation of small fragments less than 1.2 kbp in size); and finally, we remove small fragments with the probabilities described above. Sizing errors are introduced by sampling from the empirical errors found for each range of reference fragment sizes. Simulated experimental maps smaller than 150 kbp or with fewer than ten fragments are discarded, mimicking the preprocessing stage on the Argus system.
We generated 100 times greater coverage of maps with errors for the Drosophila melanogaster (BDGP 5) and Homo sapiens (hg19/GRCh37) genomes using the KpnI restriction pattern GGTAC’C, where the apostrophe indicates the position of the cut, which resulted in 13,920 fragments genomewide (forward and reverse strands) with an average fragment size (AFS) of 17.3 kbp and 573,276 fragments with AFS = 10.8 kbp, respectively.
Glocal alignment results
OPTIMA was compared against the stateoftheart algorithms Gentig v.2 (alignment module) [16, 17, 24], SOMA v.2 [10] and Valouev’s likelihoodbased fit alignment [13] for glocally aligning the simulated maps over their respective in silico reference genomes. TWIN [19] was not included in this comparison as it does not allow for errors and missing information in experimental maps.
We also ran variations of these algorithms from their default options (d): specifically, by providing the true error distribution parameters used in the simulations as input (tp), the adjusted AFS based on the organism under analysis (a) and parameter values published in the respective papers (instead of the software’s default ones), to provide, in addition, the true error distribution rates (p); and by allowing the trimming of map ends in the alignment (t). Moreover, SOMA [10] was modified to correctly handle missing in silico fragments up to 2 kbp, to run only for C _{ σ }=3, to make its results comparable, and by inverting the role of in silico–experimental input maps (v). We omitted SOMA’s statistical test (also for Valouev’s likelihood method, where it is not enabled by default), because it is unfeasible for large datasets [19], and applied only its uniqueness test (Ftest). Further details about the running parameters are provided in Supplementary Note 3 in Additional file 1. OPTIMA alignments were performed on both strands of the in silico maps, without trimming end fragments or removing any small in silico fragments.
Comparison of all methods and their variants on glocal maptosequence alignment
Algorithm  Drosophila (A)  Drosophila (B)  Human (A)  Human (B)  

S  P  S  P  S  P  S  P  
OPTIMA  90  100  49  99  83  100  43  98 
Gentig v.2 (d)  59  100  24  99  53  96  20  80 
Gentig v.2 (tp)  59  100  24  98  54  95  20  88 
SOMA v.2 (v)  72  73  31  39  50  50  17  20 
Likelihood (d+a)  49  49  29  30  24  24  14  14 
Likelihood (d+a+t)  64  65  38  39  33  34  18  19 
Likelihood (p+a+t)  75  75  39  39  62  62  19  20 
Running time and worstcase complexity for various glocal maptosequence aligners
Algorithm  Complexity  Running time  

Time  Space  Drosophila  Human  
OPTIMA  O((m−c) δ ^{3} # s e e d s)  O((m−c)^{2}+c n)  54 m  36 days 
Gentig v.2 (d)  O(# i t m δ ^{3} # h a s h e s)  O(m ^{2}+n+H a s h T a b l e)  1.32 h  75 days 
Gentig v.2 (tp)  1.85 h  174 days  
SOMA v.2 (v)  O(m ^{2} n ^{2})  O(m n)  1.28 years  1,067 years 
Likelihood (d+a)  O(m n δ ^{2})  O(m n)  22.22 h  2.72 years 
Likelihood (d+a+t)  19.62 h  2.38 years  
Likelihood (p+a+t)  41.73 h  5.53 years 
Real data analysis and comparison
To assess OPTIMA’s performance on real data we generated, inhouse, 309,879 and 296,217 optical maps for two human cell lines, GM12878 and HCT116, respectively, using the Argus system from OpGen [4, 25] (with the KpnI enzyme and ten map cards in total), and glocally aligned them over the human reference genome. Supplementary Note 4 in Additional file 1 provides the sizing error statistics.
 (r)relaxed filtering, which filters maps with fewer than ten fragments and smaller than 150 kbp;Table 3
Statistics for glocal alignment of real human optical maps from GM12878 HapMap cell line
Map card
F
Input maps
Details
OPTIMA
Gentig v.2
Increase w.r.t. Gentig v.2
Yield (genome coverage)
Avg. length and size
Avg. digestion rate
Avg. false/extra cut rate
Avg. WHT chi square sizing error
21157LB
(r)
73,365 (7.2X)
Avg. quality 0.50; 295 kbp, 18 f; AFS 16.5 kbp
25 %
9 %
3X
2X
21 f  324 kbp
66 %
0.74
–0.69
(s)
38,483 (4.7X)
Avg. quality 0.53; size 368 kbp, 22 f; AFS 17 kbp
36 %
14 %
2.6X
1.7X
23 f  361 kbp
65 %
0.73
–0.58
21159LB
(r)
75,761 (7.6X)
Avg. quality 0.47; size 300 kbp, 17 f; AFS 17.4 kbp
19 %
5 %
4X
1.6X
19 f  325 kbp
63 %
0.72
–1.07
(s)
41,236 (5.1X)
Avg. quality 0.50; size 370 kbp, 21 f; AFS 17.8 kbp
27 %
8 %
3.4X
1.3X
21 f  359 kbp
62 %
0.72
–0.97
21431LB
(r)
93,896 (8.6X)
Avg. quality 0.52; size 274 kbp, 17 f; AFS 15.8 kbp
20 %
8 %
2.6X
1.9X
21 f  305 kbp
68 %
0.77
–0.42
(s)
43,667 (5.1X)
Avg. quality 0.54; size 348 kbp, 21 f; AFS 16.3 kbp
30 %
13 %
2.4X
1.5X
23 f  343 kbp
67 %
0.77
–0.29
21443LB
(r)
66,857 (6X)
Avg. quality 0.51; size 271 kbp, 17 f; AFS 15.8 kbp
19 %
7 %
2.7X
1.3X
20 f  299 kbp
67 %
0.77
–0.50
(s)
29,991 (3.5X)
Avg. quality 0.53; size 346 kbp, 21 f; AFS 16.3 kbp
29 %
12 %
2.5X
1X
23 f  340 kbp
66 %
0.77
–0.35
TOTAL
(r)
309,879 (29.4X)
Avg. quality 0.50; size 285 kbp, 17 f; AFS 16.4 kbp
21 %
7 %
2.9X
6.8X
21 f  314 kbp
66 %
0.75
–0.66
(s)
153,377 (18.3X)
Avg. quality 0.52; size 359 kbp, 21 f; AFS 16.9 kbp
31 %
11 %
2.7X
5.5X
23 f  352 kbp
65 %
0.75
–0.55
Table 4Statistics for glocal alignment of real human optical maps from HCT116 colorectal cancer cell line
Map card
F
Input maps
Details
OPTIMA
Gentig v.2
Increase w.r.t. Gentig v.2
Yield (genome coverage)
Avg. length and size
Avg. digestion rate
Avg. false/extra cut rate
Avg. WHT chi square sizing error
17182LA
(r)
10,911 (0.9X)
Avg. quality 0.33; size 257 kbp, 16 f; AFS 15.7 kbp
4 %
0.5 %
8.1X
0.04X
19 f  245 kbp
66 %
1.29
–1.15
(s)
3,744 (0.4X)
Avg. quality 0.33; size 351 kbp, 20 f; AFS 17.7 kbp
4 %
0.9 %
4.5X
0.02X
22 f  326 kbp
63 %
1.23
–0.83
17184LA2
(r)
55,719 (5.7X)
Avg. quality 0.43; size 305 kbp, 19 f; AFS 16.3 kbp
18 %
9 %
1.9X
1.1X
23 f  332 kbp
68 %
0.76
–0.65
(s)
28,658 (3.7X)
Avg. quality 0.45; size 390 kbp, 23 f; AFS 17.2 kbp
25 %
15 %
1.6X
0.9X
25 f  378 kbp
67 %
0.74
–0.51
17185LA
(r)
56,879 (5.4X)
Avg. quality 0.55; size 285 kbp, 19 f; AFS 14.7 kbp
24 %
18 %
1.4X
1.5X
23 f  325 kbp
70 %
0.76
–0.17
(s)
28,003 (3.4X)
Avg. quality 0.59; size 365 kbp, 24 f; AFS 15.1 kbp
35 %
28 %
1.2X
1.2X
26 f  367 kbp
70 %
0.74
–0.04
17186LA3
(r)
52,984 (5.8X)
Avg. quality 0.54; size 328 kbp, 20 f; AFS 16.0 kbp
33 %
19 %
1.7X
2X
24 f  342 kbp
70 %
0.68
–0.35
(s)
31,588 (4.3X)
Avg. quality 0.56; size 404 kbp, 25 f; AFS 16.4 kbp
42 %
28 %
1.5X
1.7X
26 f  380 kbp
69 %
0.67
–0.26
17187LA
(r)
88,730 (7.8X)
Avg. quality 0.45; size 264 kbp, 18 f; AFS 14.8 kbp
12 %
7 %
1.7X
1X
21 f  285 kbp
69 %
0.94
–0.56
(s)
36,018 (4.2X)
Avg. quality 0.46; size 349 kbp, 22 f; AFS 15.8 kbp
17 %
11 %
1.6X
0.7X
24 f  338 kbp
68 %
0.92
–0.35
14593LB
(r)
30,994 (2.7X)
Avg. quality 0.39; size 261 kbp, 14 f; AFS 18.9 kbp
6 %
0.6 %
9.9X
0.2X
16 f  269 kbp
63 %
0.85
–1.23
(s)
10,944 (1.2X)
Avg. quality 0.39; size 337 kbp, 17 f; AFS 20.2 kbp
9 %
0.7 %
12.3X
0.1X
18 f  320 kbp
60 %
0.87
–0.97
TOTAL
(r)
296,217 (28.3X)
Avg. quality 0.47; size 287 kbp, 18 f; AFS 15.7 kbp
18 %
11 %
1.7X
5.7X
23 f  322 kbp
69 %
0.77
–0.44
(s)
138,955 (17.2X)
Avg. quality 0.50; size 372 kbp, 23 f; AFS 16.5 kbp
27 %
18 %
1.5X
4.6X
25 f  368 kbp
68 %
0.75
–0.28
 (s)
stringent filtering (as suggested in [4]), which filters maps with fewer than 12 fragments and smaller than 250 kbp.
The statistics were reported independently for each map card to capture the variability in terms of quality. In total, OPTIMA, with a stringent uniqueness threshold of 30, confidently aligned nearly three times as many maps as Gentig (with default parameters) for GM12878. Similarly, for HCT116, OPTIMA results were 1.7 times better than Gentig results, and corresponding improvements were also obtained using the stringently filtered datasets.
Further analyzing the error rates in the maps that OPTIMA confidently aligned (Tables 3 and 4), we observed that the overall statistics for average digestion rate d, false/extra cut rate f _{100} and sizing errors were found to be similar to those obtained using scenario (B) (see Supplementary Note 5 in Additional file 1).
Overlap alignment results
For overlap alignment, we compared OPTIMAOverlap with an overlapfinding extension of Gentig v.2 (implemented in the commercial software GenomeBuilder from OpGen, which contains a module called SCAFFOLDEXTENDER) [17, 24], as well as with Valouev’s likelihoodoverlap method [13].
In our first test, we randomly selected 1000 maps for each scenario (A) and (B) from our previously simulated maps for Drosophila (BDGP 5) and human (hg19/GRCh37) genomes. In addition, we simulated assembled sequence fragments for these genomes based on empirically derived scaffold size distributions (Drosophila assembly N50 of 2.7 Mbp with 239 scaffolds and human assembly N50 of 3.0 Mbp with 98,987 scaffolds); the simulated assemblies were used to generate in silico maps (filtered for those with fewer than four nonend fragments, because these cannot be confidently aligned [14, 15]), which were then aligned with the simulated experimental maps.
For our second test, we compared all methods on optical mapping data generated inhouse from the K562 human cancer cell line on OpGen’s Argus system, where a random sample of 2000 maps with at least ten fragments was extracted. In silico maps were generated from de novo assemblies of shotgun Illumina sequencing data (HiSeq) and six matepair libraries with insert sizes ranging from 1.1 kbp to 25 kbp [26] (the final assembly had an N50 of 3.1 Mbp and 76,990 scaffolds in total, using SOAPdenovo v.1.05 [27] with a kmer size of 51 for contig assembly and Opera v.1.1 [28] for scaffolding with mate pairs). It is likely that this dataset represents a harder scenario, with assembly gaps/errors and genomic rearrangements confounding the analysis. It also represents a likely use case where mapping data will be critical to detect large structural variations, disambiguate complex rearrangements and, ultimately, assemble cancer genomes de novo.
Comparison of methods for overlap maptosequence alignment
Algorithm  Drosophila (A)  Drosophila (B)  Human (A)  Human (B)  Human real data  

E  P  E  P  E  P  E  P  E  
OPTIMAoverlap  91  100  53  98  72  99  29  97  23 
Gentig v.2 (d)  69  100  29  93  51  93  19  83  14 
Likelihoodoverlap (d + a)  59  74  36  52  21  41  9  26  12 
Utility in realworld applications
Overlap alignments form a critical building block for applications such as OpGen’s GenomeBuilder and its use in boosting assembly quality [4]. As OPTIMAOverlap can work with lower quality data (scenario (B) in our simulations; GenomeBuilder would typically filter out such data) and also provide improved sensitivity in detecting overlap alignments, we estimate that its use could reduce the requirement for generating mapping data by half. As the cost of mapping data for the assembly of large eukaryotic genomes can range from USD 20,000 to 100,000, this can lead to significant cost savings.
We similarly compared OPTIMA and Gentig on the two human cell line results, shown in Tables 3 and 4, in order to calculate how much mapping data would be needed for sufficient aligned coverage of the human genome to enable structural variation analysis. By analyzing the alignment rate increase of OPTIMA compared to Gentig, a 1.5 to 2.9 times increase on average, we computed the corresponding cost reduction to be 33–66 %, with an average cost reduction of 54 % for relaxed filtering of data (r) and 49 % for stringent filtering (s). These results suggest that for structural variation analysis on the human genome, particularly for cancer genomes, OPTIMA can significantly reduce project costs (in the tens of thousands of dollars) while enabling faster analyses of the data.
Conclusion
With the availability of new mapping technologies (for example, Nabsys) and greater use of existing ones to complement highthroughput sequencing, there is a critical need for robust computational tools that can combine genomic mapping and sequence data efficiently. In this work, we introduce two new alignment tools that address this need for a wide range of applications, from genome assembly to structural variation analysis. Our benchmarking results provide evidence that these methods outperform existing approaches in sensitivity and runtime while providing highly precise alignments in a range of experimental settings. Similar results are also seen in real datasets from human cell lines, suggesting that they could help in significantly reducing the cost of optical mapping analysis and thus increase its usage.
In the development of OPTIMA and OPTIMAOverlap, we establish two key new ideas for map alignment. The first is the introduction of composite seeds, an idea that echoes the idea of spaced seeds in the context of continuousvalued sequence alignment. Composite seeds allow us to develop efficient seedandextend aligners for maptosequence alignment of erroneous mapping data. We believe that similar ideas can be applied for maptomap alignment and de novo assembly of experimental maps. The second concept is the development of a conservative statistical testing approach that does not require knowledge of the true distribution of errors or an expensive permutation test to evaluate the uniqueness and significance of alignments. This allows us to significantly reduce the runtime cost of this step without sacrificing specificity or the ability to be agnostic with respect to error rates. Although our experiments with real data in this work were limited to data generated on the Argus system from OpGen, similar ideas (with minor variations) should also be applicable to data generated by other technologies such as the Irys platform from BioNano Genomics.
In future work, we plan to implement further runtime and memory optimizations to OPTIMA and OPTIMAOverlap and to explore their use for superscaffolding of large genomes as well as for studying genomic rearrangements in cancer.
Availability and requirements

Project name: OPTIMA: Indexbased maptosequence alignment in large eukaryotic genomes

Project home page: http://www.davideverzotto.it/research/OPTIMA, https://github.com/verznet/OPTIMA

Operating system: Platform independent

Programming language: Java 7+

Other requirements: Java Development Kit 7+, Apache Commons Math 3.2, CERN Colt 1.2.0

License: LGPL (Lesser General Public License) 2.1, OSI compliant

Any restrictions to use by nonacademics: none
Availability of supporting data
Declarations
Acknowledgments
We would like to thank Juntao Li for useful discussions on the statistical analysis and Denis Bertrand and Burton KH Chia for valuable feedback and comments on a draft of this manuscript. We thank Fei Yao for help on experiments and troubleshooting of the Argus system and Chee Seng Chan and Lavanya Veeravalli for postprocessing the optical mapping data.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Authors’ Affiliations
References
 OpGen Inc. 2015. Available from: http://www.opgen.com. Accessed 9 Dec 2015.
 BioNano Genomics Inc. 2015. Available from: http://www.bionanogenomics.com. Accessed 9 Dec 2015.
 Nabsys Inc. 2015. Available from: http://www.nabsys.com. Accessed 9 Dec 2015.
 Dong Y, Xie M, Jiang Y, Xiao N, Du X, Wang W, et al.Sequencing and automated wholegenome optical mapping of the genome of a domestic goat (Capra hircus). Nat Biotechnol. 2013; 31:135–41.PubMedView ArticleGoogle Scholar
 Ganapathy G, Howard J, Ward J, Zhang G, Phillippy A, Jarvis E, et al.Highcoverage sequencing and annotated assemblies of the budgerigar genome. GigaScience. 2014; 3:11.PubMedPubMed CentralView ArticleGoogle Scholar
 Lam ET, Hastie A, Lin C, Ehrlich D, Nagarajan N, Kwok PY, et al.Genome mapping on nanochannel arrays for structural variation analysis and sequence assembly. Nat Biotechnol. 2012; 30:771–6.PubMedView ArticleGoogle Scholar
 Ray M, Goldstein S, Zhou S, Potamousis K, Sarkar D, Schwartz D, et al.Discovery of structural alterations in solid tumor oligodendroglioma by single molecule analysis. BMC Genomics. 2013; 14:505.PubMedPubMed CentralView ArticleGoogle Scholar
 Waterman MS, Smith TF, Katcher H. Algorithms for restriction map comparisons. Nucleic Acids Res. 1984; 12:237–42.PubMedPubMed CentralView ArticleGoogle Scholar
 Mendelowitz L, Pop M. Computational methods for optical mapping. GigaScience. 2014; 3:33.PubMedPubMed CentralView ArticleGoogle Scholar
 Nagarajan N, Read TD, Pop M. Scaffolding and validation of bacterial genome assemblies using optical restriction maps. Bioinformatics. 2008; 24:1229–35.PubMedPubMed CentralView ArticleGoogle Scholar
 Lin H, Goldstein S, Mendelowitz L, Zhou S, Schwartz D, Pop M, et al.AGORA: assembly guided by optical restriction alignment. BMC Bioinforma. 2012; 13:189.View ArticleGoogle Scholar
 Antoniotti M, Anantharaman T, Paxia S, Mishra B. Genomics via optical mapping IV: sequence validation via optical map matching. New York, NY, USA: New York University; 2001.Google Scholar
 Valouev A, Li L, Liu YC, Schwartz DC, Yang Y, Waterman MS, et al.Alignment of optical maps. J Comput Biol. 2006; 13:442–62.PubMedView ArticleGoogle Scholar
 Anantharaman TS, Mishra B. A probabilistic analysis of false positives in optical map alignment and validation. In: First international workshop on algorithms in bioinformatics (WABI). Aarhus, Denmark: Springer: 2001. p. 27–40.Google Scholar
 Sarkar D, Goldstein S, Schwartz DC, Newton MA. Statistical significance of optical map alignments. J Comput Biol. 2012; 19:478–92.PubMedPubMed CentralView ArticleGoogle Scholar
 Anantharaman TS, Mishra B, Schwartz DC. Genomics via optical mapping II: ordered restriction maps. J Comput Biol. 1997; 4:91–118.PubMedView ArticleGoogle Scholar
 Anantharaman TS, Mishra B, Schwartz DC. Genomics via optical mapping III: contiging genomic DNA. In: Proceedings of the 7th international conference on intelligent systems for molecular biology (ISMB 1999). Heidelberg, Germany: AAAI: 1999. p. 18–27.Google Scholar
 Karp RM, Rabin MO. Efficient randomized patternmatching algorithms. IBM J Res Dev. 1987; 31:249–60.View ArticleGoogle Scholar
 Muggli MD, Puglisi SJ, Boucher C. Efficient indexed alignment of contigs to optical maps. In: Algorithms in Bioinformatics (WABI 2014). vol. 8701 of LNCS. Wrocław, Poland: Springer: 2014. p. 68–81.Google Scholar
 Verzotto D, Teo ASM, Hillmer AM, Nagarajan N. Indexbased maptosequence alignment in large eukaryotic genomes. In: Proceedings of the fifth RECOMB satellite workshop on massively parallel sequencing (RECOMBSeq 2015). Warsaw (Poland): Cold Spring Harbor Labs Journals: 2015. p. 1–11. doi:10.1101/017194.Google Scholar
 Califano A, Rigoutsos I. FLASH: a fast lookup algorithm for string homology. In: Proceedings of the 1st International Conference on Intelligent Systems for Molecular Biology (ISMB 1993). Bethesda, MD, USA: AAAI: 1993. p. 56–64.Google Scholar
 Storey JD, Tibshirani R. Statistical significance for genomewide studies. Proc Nat Acad Sci. 2003; 100:9440–5.PubMedPubMed CentralView ArticleGoogle Scholar
 Kawahara Y, de la Bastide M, Hamilton J, Kanamori H, McCombie R, Matsumoto T, et al.Improvement of the Oryza sativa Nipponbare reference genome using next generation sequence and optical map data. Rice. 2013; 6:4.PubMedView ArticleGoogle Scholar
 Anantharaman TS, Mysore V, Mishra B. Fast and cheap genome wide haplotype construction via optical mapping In: Altman RB, Jung TA, Klein TE, Dunker AK, Hunter L, editors. Proceedings of the 10th Pacific Symposium on Biocomputing (PSB 2005). Hawaii, USA: World Scientific: 2005. p. 1–16.Google Scholar
 Teo ASM, Verzotto D, Yao F, Nagarajan N, Hillmer AM. Singlemolecule optical genome mapping of a human HapMap and a colorectal cancer cell line. GigaScience. 2015; 4:65. doi:10.1186/s1374201501061.PubMedPubMed CentralView ArticleGoogle Scholar
 Yao F, Ariyaratne PN, Hillmer AM, Liu ET, Ruan Y. Long span DNA PairedEndTag (DNAPET) sequencing strategy for the interrogation of genomic structural mutations and fusionpointguided reconstruction of amplicons. PLOS ONE. 2012; e46152:7.Google Scholar
 Li R, Zhu H, Ruan J, Wang J, Fang X, Wang J, et al.De novo assembly of human genomes with massively parallel short read sequencing. Genome Res. 2010; 20:265–72.PubMedPubMed CentralView ArticleGoogle Scholar
 Gao S, Sung WK, Nagarajan N. Opera: Reconstructing Optimal Genomic Scaffolds with HighThroughput PairedEnd Sequences. J Comput Biol. 2011; 18:1681–91.PubMedPubMed CentralView ArticleGoogle Scholar
 Verzotto D, Teo ASM, Hillmer AM, Nagarajan N. Supporting software for OPTIMA, a tool for sensitive and accurate wholegenome alignment of errorprone genomic maps by combinatorial indexing and technologyagnostic statistical analysis; 2015. GigaScience Database. doi:10.5524/100165.Google Scholar
 Teo ASM, Verzotto D, Yao F, Nagarajan N, Hillmer AM. Supporting singlemolecule optical genome mapping data from human HapMap and colorectal cancer cell lines; 2015. GigaScience Database. doi:10.5524/100182.Google Scholar