OPTIMA is the first alignment tool based on the seedandextend paradigm that can deal with erroneous mapping data. The basic paradigm is similar to that used for the alignment of discretevalued sequences (allowing for mismatches and indels) and is as follows. We start by indexing the in silico maps, so that we can use this information later, and find seeds for each experimental map o corresponding to some indexed regions of those sequences. We then extend these seeds by using dynamic programming to try to align the whole experimental map to the corresponding in silico map region. For each map o, feasible solutions – as defined by the index structure, size of the genome and maximum error rate – are then evaluated by a scoring scheme to select the optimal solution. Finally, the statistical significance and uniqueness of the optimal solution are determined by comparing and modeling all the identified feasible solutions.
Indexing continuousvalued seeds
The definition of appropriate seeds is critical in a seedandextend approach in order to maintain a good balance between sensitivity and speed. A direct extension of discretevalued seeds to continuous values is to consider values that are close to each other (as defined by the C_{
σ
} bound) as matches. However, as mapping data typically have high error rates [13, 16] and represent short sequences (for example, on average, optical maps contain 10–22 fragments, representing roughly a 250 kbp region of the genome), a seed of c consecutive fragments (cmer) is likely to have low sensitivity unless we use a naive c=1 approach (see Fig. 2 for a comparison) and pay a significant runtime penalty that scales with genome size [14, 16]. Therefore, we propose and validate the following composite seed extension for continuousvalued seeds, analogous to the work on spaced seeds for discretevalued sequences [21].
Definition
4 (Composite seed).
Let \(r_{j_{1}}\), \(r_{j_{2}}\) and \(r_{j_{3}}\) be consecutive restriction fragments from a reference in silico map. A continuousvalued composite seed, for c=2, is given by including all of the following:

(i) the cmer \(r_{j_{1}}, r_{j_{2}}\), corresponding to no false cuts in the in silico map;

(ii) the cmer \(r_{j_{1}}+r_{j_{2}}, r_{j_{3}}\), corresponding to a missing cut in the experimental map (or false cut in the in silico map); and

(iii) the cmer \(r_{j_{1}}, r_{j_{2}}+r_{j_{3}}\), corresponding to a different missing cut in the experimental map (or false cut in the in silico map).
The reference index would then contain all ctuples corresponding to a composite seed, as defined in Definition 4, for each location in the reference map. In addition, to account for false cuts in the experimental map, for each set of consecutive fragments \(o_{i_{1}}\), \(o_{i_{2}}\) and \(o_{i_{3}}\) in the experimental maps, we search for ctuples of the type \(o_{i_{1}}, o_{i_{2}}\) and \(o_{i_{1}} + o_{i_{2}}, o_{i_{3}}\) in the index (see Composite seeds (iv) depicted in Fig. 1c).
To index the seeds, we adopt a straightforward approach where all ctuples are collected and sorted into the same index in lexicographic order by c_{1} (where the c_{
i
} are elements in the ctuple). Lookups can be performed by binary search over fragmentsized intervals that satisfy the C_{
σ
} bound for c_{1} and a subsequent linear scan of the other elements c_{
i
}, for i≥2, while verifying the C_{
σ
} bound in each case. Note that, because seeds are typically expected to be of higher quality, we can apply a more stringent threshold on seed fragment size matches (for example, we used \(C^{Seed}_{\sigma } = 2\)).
As shown in the “Results and discussion” section, this approach significantly reduces the space of candidate alignments without affecting the sensitivity of the search. A comparison between the various seeding approaches is shown in Fig. 2, which highlights the advantages of composite seeds with respect to 2mers.
Overall, the computational cost of finding seeds using this approach is O(m (logn+c
#seeds_{c=1})) per experimental map, where n is the total length of the in silico maps in fragments, m≪n is the length of the experimental map and #seeds_{c=1} is the number of seeds found in the first level of the index lookup, before narrowing down the list to the actual number of seeds that will be extended (#seeds). The cost and space of creating the reference index is thus O(c
n), if the number of errors considered in the composite seeds is limited and bounded (as in Definition 4), and radix sort is used to sort the index. This approach drastically reduces the number of alignments computed in comparison to more general, global alignment searches [10], as will be shown later in the “Results and discussion” section.
Dynamic programmingbased extension of seeds
In order to extend a seed to get a glocal alignment we adopt a scoring scheme similar to that used in SOMA (see [10]). This allows us to evaluate alignments without relying on a likelihoodbased framework that requires prior information on error distributions as input [13]. In addition, we can use dynamic programming to efficiently find glocal alignments that optimize this score and contain the seed (see Fig. 1d); specifically, for each seed side we proceed along the dynamic programming matrix by aligning the end of the sth experimental fragment with the end of the tth in silico fragment using backtracking to find feasible matches, that is, those that satisfy Eq. 1 and minimize the total number of cut errors (#cuterrors= missing cuts + false cuts + missing fragments found), with ties being broken by minimizing a χ^{2} function for fragment sizing errors:
$$ {Score}_{s, t} = \min_{k \le s, l \le t} C_{ce} (s  k + t  l) + {\chi}_{k..s, l..t}^{2} + {Score}_{k1, l1}, $$
((2))
where the first index of each subscript represents experimental fragments, the second index represents the in silico fragments, s−k is the number of false cuts, t−l is the number of missing cuts, C_{
ce
} is a constant larger than the maximum possible total for χ^{2}, \({\chi }_{k..s, l..t}^{2} = \left (\sum \limits _{i=k}^{s} o_{i}  \sum \limits _{j=l}^{t} r_{j}\right)^{2} / \left (\sum \limits _{j=l}^{t} {\sigma _{j}^{2}}\right)\) and Score_{0,0}=0, Score_{i,0}=∞ and Score_{0,j}=∞.
Note that a small in silico fragment is considered as missing if this condition allows for a valid alignment that improves the local χ^{2} on nearby matches by half (up to three consecutive fragments).
As in [16], we band the dynamic programming and its backtracking to avoid unnecessary computation. In particular, as we show in Supplementary Note 1 in Additional file 1, based on parameter estimates for optical mapping data, restricting alignments to eight missing cuts or five false cuts, consecutively, should retain high sensitivity. In addition, we stop the dynamic programmingbased extension if no feasible solutions can be found for the current seed after having analyzed at least f fragments (default of five).
The computational cost of extending a seed (ctuple) of an experimental map with m fragments is thus, in the worst case, O((m−c) δ^{3}) time, where δ is the bandwidth of the dynamic programming, and O((m−c)^{2}) space for allocating the dynamic programming matrix for each side of the seed.
Statistical significance and uniqueness analysis
To evaluate the statistical significance of a candidate alignment, we exploit the fact that we have explored the space of feasible alignments in our search and use these alignments to approximate a random sample from a (conservative) null model. The assumption here is that there is only one true alignment and that, therefore, the population of these suboptimal alignments can provide a conservative null model for evaluating the significance of an alignment; more specifically, for each candidate alignment found, we compute its distance from the null model in a feature space (to be defined later) using a Zscore transformation and then use this score to evaluate whether it is optimal, statistically significant and unique (see Fig. 3 for an example).
We start by identifying a set F of features, independent with respect to false positive (or random) alignments, that are expected to follow the normal distribution (for example, using the central limit theorem) and be comparable between different alignments of the same experimental map, and we compute a Zscore for each feature f∈F and for each candidate solution π∈Π identified through the seeding method. Each Zscore takes into account the mean and standard deviation of a particular feature f among all candidate alignments found:
$$ {Zscore}(\mathrm{\pi} \in \mathrm{\Pi}, f) = \frac{f_{\mathrm{\pi}}  {Mean}(f_{\mathrm{\Pi}})}{{SD}(f_{\mathrm{\Pi}})}. $$
((3))
Accounting for all considered features f_{
i
}, with 1≤i≤k and k≥2, the resulting score is given by:
$$ \vartheta(\mathrm{\pi} \in \mathrm{\Pi}) = {Zscore}\left(\sum_{i} s_{i} \times {Zscore}(\mathrm{\pi}, f_{i})\right), $$
((4))
where s_{
i
}=1 if lower values of feature f_{
i
} are preferable and −1 otherwise, and the corresponding pvalue is p_{π}=Pnorm(𝜗(π)), that is, the probability that a ‘random’ Zscore will be less than 𝜗(π) under the standard normal distribution.
In our case, we chose a set of features based on the number of matches (#matches), the total number of cut errors and the WilsonHilferty transformation (WHT) of the χ^{2} score for sizing errors (which converges quickly to a standard normal distribution):
$$ \textrm{WHT}\left({\chi}^{2}, {\# matches}\right) = \frac{\sqrt[3]{\frac{{\chi}^{2}}{{\# matches}}}  \left(1  \frac{1}{9} \frac{2}{{\# matches}}\right)}{\sqrt{\frac{1}{9} \frac{2}{{\# matches}}}}. $$
((5))
Note that this set can be shown to be composed of approximately independent features for false positive alignments (Zscore pairwise covariances between all features did not show a significant alteration of the final Zscores when accounting for them in all of our simulations). By the central limit theorem, the mean of the first two features can be approximated by the normal distribution when the number of candidate solutions is large enough (typically, greater than 60 distinct solutions), and, by Slutsky’s theorem, their sample variance would not have a significant effect on the distribution of the test statistics. As lower values of #cuterrors and WHT(χ^{2},#matches) indicate better solutions, while a higher number of matches represents more reliable alignments, we need to adjust the signs of their Zscores accordingly. The final Zscore 𝜗(π) computed for each candidate solution π is therefore given by:
$$ \begin{aligned} \vartheta(\mathrm{\pi} \in \mathrm{\Pi}) &= {Zscore}({Zscore}(\mathrm{\pi}, {\# matches})\\ &\quad+ {Zscore}(\mathrm{\pi}, {\# cut errors})\\ &\quad+ {Zscore}(\mathrm{\pi}, \textrm{WHT}({\chi}^{2}, {\# matches}))\,), \end{aligned} $$
((6))
which can be subsequently converted into the pvalue p_{π}. The candidate solution π^{∗} with the lowest pvalue p^{∗} is reported as the optimal solution, as shown in Fig. 3.
The statistical significance of the optimal solution can then be assessed through a false discovery rate qvalue analysis [22] based on all candidate solutions found for comparable experimental maps, for example, those with the same number of fragments (default of q=0.01). To assess uniqueness, we set a threshold on the ratio of pvalues between the best solution and the nextbest solution (default of five). See Supplementary Note 2 in Additional file 1 for further algorithmic details.
In summary, our statistical scoring approach finds an optimal solution and evaluates its statistical significance and uniqueness in a unified framework, thus allowing for savings in computational time and space compared to a permutation test, without restricting the method to a scenario where experimental error probabilities are known a priori.
Extension to overlap alignment
To extend OPTIMA to compute and evaluate overlap alignments – a key step in assembly pipelines that use mapping data [4, 5, 23] – we use a slidingwindow approach based on OPTIMA. This allows us to continue using the statistical evaluation procedure defined in OPTIMA that relies on learning parameters from comparable alignments (that is, those with the same number, size and order of experimental fragments) in a setting where the final alignments are not always of the same length and structure.
Briefly, for each map, OPTIMAOverlap first finds optimal submap alignments, evaluates their significance and uniqueness, and then tries to extend the candidate alignments found until it reaches the ends of either the experimental map or the in silico map, in order to choose the most significant overlap alignments (see Fig. 1e). This approach begins by dividing an experimental map into submaps of length l with a sliding window and then glocally aligning them to in silico maps using OPTIMA (again allowing for truncated ends to account for high error rates). Each glocal alignment subproblem will then return either:

(i)
a significant and unique submap alignment;

(ii)
an alignment labeled as nonsignificant and/or nonunique (which will be considered as a false alignment); or

(iii)
no feasible alignments found.
Optimal solutions to the subproblems are then ranked by pvalue (smallest to largest) and iterated through to select submaps that should be extended. At each stage we check the significance and uniqueness of the reported solutions (compared to the others), as well as checking for potential cases of identical or conflicting alignments as defined below.
Definition
5 (Conflicting alignments).
A submap alignment π_{1} is said to be conflicting with another alignment π_{2} if either:

the submap of π_{1} overlaps the submap of π_{2}; or

π_{1} aligns to the same in silico map as π_{2}, but in a different location or strand.
Conflicting alignments can result in ambiguous placement of an experimental map on a database of in silico maps, but condition (a) could be relaxed in some cases, for example, when experimental maps are known to overlap multiple in silico maps in the same region. Therefore, while iterating through the list of submaps, the following rules are implemented:

1.
Significance: if the current solution π_{
i
} is labeled as a false alignment, then we stop iterating through the rest of the list.

2.
Uniqueness: we skip an alignment if either: (i) π_{
i
} represents the same overlap alignment as a more significant solution; (ii) π_{
i
} is conflicting with a solution with a lower pvalue (that is, seen before); or (iii) π_{
i
} is not unique with respect to other solutions π_{
j
} with j>i (that is, having greater pvalues) that it is conflicting with.

3.
Extension with dynamic programming: optimal overlap solutions are identified according to Eq. 2, where ties are broken in favor of longer valid alignments.
This approach allows us to report multiple overlap alignments (including split alignments) for an experimental map while using the qvalue analysis, as before, to report all alignments with q≤0.01. For the qvalue analysis, we consider all candidate solutions found for the sliding windows in order to learn the qvalue parameters. In addition, we can reuse the dynamic programming matrix computed for each seed across submap alignments and thus complete the overlap alignment with the same asymptotic time and space complexity as the glocal alignment.