OPTIMA is the first alignment tool based on the seed-and-extend paradigm that can deal with erroneous mapping data. The basic paradigm is similar to that used for the alignment of discrete-valued sequences (allowing for mismatches and indels) and is as follows. We start by indexing the in silico maps, so that we can use this information later, and find seeds for each experimental map o corresponding to some indexed regions of those sequences. We then extend these seeds by using dynamic programming to try to align the whole experimental map to the corresponding in silico map region. For each map o, feasible solutions – as defined by the index structure, size of the genome and maximum error rate – are then evaluated by a scoring scheme to select the optimal solution. Finally, the statistical significance and uniqueness of the optimal solution are determined by comparing and modeling all the identified feasible solutions.
Indexing continuous-valued seeds
The definition of appropriate seeds is critical in a seed-and-extend approach in order to maintain a good balance between sensitivity and speed. A direct extension of discrete-valued seeds to continuous values is to consider values that are close to each other (as defined by the C
σ
bound) as matches. However, as mapping data typically have high error rates [13, 16] and represent short sequences (for example, on average, optical maps contain 10–22 fragments, representing roughly a 250 kbp region of the genome), a seed of c consecutive fragments (c-mer) is likely to have low sensitivity unless we use a naive c=1 approach (see Fig. 2 for a comparison) and pay a significant runtime penalty that scales with genome size [14, 16]. Therefore, we propose and validate the following composite seed extension for continuous-valued seeds, analogous to the work on spaced seeds for discrete-valued sequences [21].
Definition
4 (Composite seed).
Let \(r_{j_{1}}\), \(r_{j_{2}}\) and \(r_{j_{3}}\) be consecutive restriction fragments from a reference in silico map. A continuous-valued composite seed, for c=2, is given by including all of the following:
-
(i) the c-mer \(r_{j_{1}}, r_{j_{2}}\), corresponding to no false cuts in the in silico map;
-
(ii) the c-mer \(r_{j_{1}}+r_{j_{2}}, r_{j_{3}}\), corresponding to a missing cut in the experimental map (or false cut in the in silico map); and
-
(iii) the c-mer \(r_{j_{1}}, r_{j_{2}}+r_{j_{3}}\), corresponding to a different missing cut in the experimental map (or false cut in the in silico map).
The reference index would then contain all c-tuples corresponding to a composite seed, as defined in Definition 4, for each location in the reference map. In addition, to account for false cuts in the experimental map, for each set of consecutive fragments \(o_{i_{1}}\), \(o_{i_{2}}\) and \(o_{i_{3}}\) in the experimental maps, we search for c-tuples of the type \(o_{i_{1}}, o_{i_{2}}\) and \(o_{i_{1}} + o_{i_{2}}, o_{i_{3}}\) in the index (see Composite seeds (iv) depicted in Fig. 1c).
To index the seeds, we adopt a straightforward approach where all c-tuples are collected and sorted into the same index in lexicographic order by c1 (where the c
i
are elements in the c-tuple). Lookups can be performed by binary search over fragment-sized intervals that satisfy the C
σ
bound for c1 and a subsequent linear scan of the other elements c
i
, for i≥2, while verifying the C
σ
bound in each case. Note that, because seeds are typically expected to be of higher quality, we can apply a more stringent threshold on seed fragment size matches (for example, we used \(C^{Seed}_{\sigma } = 2\)).
As shown in the “Results and discussion” section, this approach significantly reduces the space of candidate alignments without affecting the sensitivity of the search. A comparison between the various seeding approaches is shown in Fig. 2, which highlights the advantages of composite seeds with respect to 2-mers.
Overall, the computational cost of finding seeds using this approach is O(m (logn+c
#seedsc=1)) per experimental map, where n is the total length of the in silico maps in fragments, m≪n is the length of the experimental map and #seedsc=1 is the number of seeds found in the first level of the index lookup, before narrowing down the list to the actual number of seeds that will be extended (#seeds). The cost and space of creating the reference index is thus O(c
n), if the number of errors considered in the composite seeds is limited and bounded (as in Definition 4), and radix sort is used to sort the index. This approach drastically reduces the number of alignments computed in comparison to more general, global alignment searches [10], as will be shown later in the “Results and discussion” section.
Dynamic programming-based extension of seeds
In order to extend a seed to get a glocal alignment we adopt a scoring scheme similar to that used in SOMA (see [10]). This allows us to evaluate alignments without relying on a likelihood-based framework that requires prior information on error distributions as input [13]. In addition, we can use dynamic programming to efficiently find glocal alignments that optimize this score and contain the seed (see Fig. 1d); specifically, for each seed side we proceed along the dynamic programming matrix by aligning the end of the sth experimental fragment with the end of the tth in silico fragment using backtracking to find feasible matches, that is, those that satisfy Eq. 1 and minimize the total number of cut errors (#cuterrors= missing cuts + false cuts + missing fragments found), with ties being broken by minimizing a χ2 function for fragment sizing errors:
$$ {Score}_{s, t} = \min_{k \le s, l \le t} C_{ce} (s - k + t - l) + {\chi}_{k..s, l..t}^{2} + {Score}_{k-1, l-1}, $$
((2))
where the first index of each subscript represents experimental fragments, the second index represents the in silico fragments, s−k is the number of false cuts, t−l is the number of missing cuts, C
ce
is a constant larger than the maximum possible total for χ2, \({\chi }_{k..s, l..t}^{2} = \left (\sum \limits _{i=k}^{s} o_{i} - \sum \limits _{j=l}^{t} r_{j}\right)^{2} / \left (\sum \limits _{j=l}^{t} {\sigma _{j}^{2}}\right)\) and Score0,0=0, Scorei,0=∞ and Score0,j=∞.
Note that a small in silico fragment is considered as missing if this condition allows for a valid alignment that improves the local χ2 on nearby matches by half (up to three consecutive fragments).
As in [16], we band the dynamic programming and its backtracking to avoid unnecessary computation. In particular, as we show in Supplementary Note 1 in Additional file 1, based on parameter estimates for optical mapping data, restricting alignments to eight missing cuts or five false cuts, consecutively, should retain high sensitivity. In addition, we stop the dynamic programming-based extension if no feasible solutions can be found for the current seed after having analyzed at least f fragments (default of five).
The computational cost of extending a seed (c-tuple) of an experimental map with m fragments is thus, in the worst case, O((m−c) δ3) time, where δ is the bandwidth of the dynamic programming, and O((m−c)2) space for allocating the dynamic programming matrix for each side of the seed.
Statistical significance and uniqueness analysis
To evaluate the statistical significance of a candidate alignment, we exploit the fact that we have explored the space of feasible alignments in our search and use these alignments to approximate a random sample from a (conservative) null model. The assumption here is that there is only one true alignment and that, therefore, the population of these sub-optimal alignments can provide a conservative null model for evaluating the significance of an alignment; more specifically, for each candidate alignment found, we compute its distance from the null model in a feature space (to be defined later) using a Z-score transformation and then use this score to evaluate whether it is optimal, statistically significant and unique (see Fig. 3 for an example).
We start by identifying a set F of features, independent with respect to false positive (or random) alignments, that are expected to follow the normal distribution (for example, using the central limit theorem) and be comparable between different alignments of the same experimental map, and we compute a Z-score for each feature f∈F and for each candidate solution π∈Π identified through the seeding method. Each Z-score takes into account the mean and standard deviation of a particular feature f among all candidate alignments found:
$$ {Z-score}(\mathrm{\pi} \in \mathrm{\Pi}, f) = \frac{f_{\mathrm{\pi}} - {Mean}(f_{\mathrm{\Pi}})}{{SD}(f_{\mathrm{\Pi}})}. $$
((3))
Accounting for all considered features f
i
, with 1≤i≤k and k≥2, the resulting score is given by:
$$ \vartheta(\mathrm{\pi} \in \mathrm{\Pi}) = {Z-score}\left(\sum_{i} s_{i} \times {Z-score}(\mathrm{\pi}, f_{i})\right), $$
((4))
where s
i
=1 if lower values of feature f
i
are preferable and −1 otherwise, and the corresponding p-value is pπ=Pnorm(𝜗(π)), that is, the probability that a ‘random’ Z-score will be less than 𝜗(π) under the standard normal distribution.
In our case, we chose a set of features based on the number of matches (#matches), the total number of cut errors and the Wilson-Hilferty transformation (WHT) of the χ2 score for sizing errors (which converges quickly to a standard normal distribution):
$$ \textrm{WHT}\left({\chi}^{2}, {\# matches}\right) = \frac{\sqrt[3]{\frac{{\chi}^{2}}{{\# matches}}} - \left(1 - \frac{1}{9} \frac{2}{{\# matches}}\right)}{\sqrt{\frac{1}{9} \frac{2}{{\# matches}}}}. $$
((5))
Note that this set can be shown to be composed of approximately independent features for false positive alignments (Z-score pairwise covariances between all features did not show a significant alteration of the final Z-scores when accounting for them in all of our simulations). By the central limit theorem, the mean of the first two features can be approximated by the normal distribution when the number of candidate solutions is large enough (typically, greater than 60 distinct solutions), and, by Slutsky’s theorem, their sample variance would not have a significant effect on the distribution of the test statistics. As lower values of #cuterrors and WHT(χ2,#matches) indicate better solutions, while a higher number of matches represents more reliable alignments, we need to adjust the signs of their Z-scores accordingly. The final Z-score 𝜗(π) computed for each candidate solution π is therefore given by:
$$ \begin{aligned} \vartheta(\mathrm{\pi} \in \mathrm{\Pi}) &= {Z-score}(-{Z-score}(\mathrm{\pi}, {\# matches})\\ &\quad+ {Z-score}(\mathrm{\pi}, {\# cut errors})\\ &\quad+ {Z-score}(\mathrm{\pi}, \textrm{WHT}({\chi}^{2}, {\# matches}))\,), \end{aligned} $$
((6))
which can be subsequently converted into the p-value pπ. The candidate solution π∗ with the lowest p-value p∗ is reported as the optimal solution, as shown in Fig. 3.
The statistical significance of the optimal solution can then be assessed through a false discovery rate q-value analysis [22] based on all candidate solutions found for comparable experimental maps, for example, those with the same number of fragments (default of q=0.01). To assess uniqueness, we set a threshold on the ratio of p-values between the best solution and the next-best solution (default of five). See Supplementary Note 2 in Additional file 1 for further algorithmic details.
In summary, our statistical scoring approach finds an optimal solution and evaluates its statistical significance and uniqueness in a unified framework, thus allowing for savings in computational time and space compared to a permutation test, without restricting the method to a scenario where experimental error probabilities are known a priori.
Extension to overlap alignment
To extend OPTIMA to compute and evaluate overlap alignments – a key step in assembly pipelines that use mapping data [4, 5, 23] – we use a sliding-window approach based on OPTIMA. This allows us to continue using the statistical evaluation procedure defined in OPTIMA that relies on learning parameters from comparable alignments (that is, those with the same number, size and order of experimental fragments) in a setting where the final alignments are not always of the same length and structure.
Briefly, for each map, OPTIMA-Overlap first finds optimal sub-map alignments, evaluates their significance and uniqueness, and then tries to extend the candidate alignments found until it reaches the ends of either the experimental map or the in silico map, in order to choose the most significant overlap alignments (see Fig. 1e). This approach begins by dividing an experimental map into sub-maps of length l with a sliding window and then glocally aligning them to in silico maps using OPTIMA (again allowing for truncated ends to account for high error rates). Each glocal alignment sub-problem will then return either:
-
(i)
a significant and unique sub-map alignment;
-
(ii)
an alignment labeled as non-significant and/or non-unique (which will be considered as a false alignment); or
-
(iii)
no feasible alignments found.
Optimal solutions to the sub-problems are then ranked by p-value (smallest to largest) and iterated through to select sub-maps that should be extended. At each stage we check the significance and uniqueness of the reported solutions (compared to the others), as well as checking for potential cases of identical or conflicting alignments as defined below.
Definition
5 (Conflicting alignments).
A sub-map alignment π1 is said to be conflicting with another alignment π2 if either:
-
the sub-map of π1 overlaps the sub-map of π2; or
-
π1 aligns to the same in silico map as π2, but in a different location or strand.
Conflicting alignments can result in ambiguous placement of an experimental map on a database of in silico maps, but condition (a) could be relaxed in some cases, for example, when experimental maps are known to overlap multiple in silico maps in the same region. Therefore, while iterating through the list of sub-maps, the following rules are implemented:
-
1.
Significance: if the current solution π
i
is labeled as a false alignment, then we stop iterating through the rest of the list.
-
2.
Uniqueness: we skip an alignment if either: (i) π
i
represents the same overlap alignment as a more significant solution; (ii) π
i
is conflicting with a solution with a lower p-value (that is, seen before); or (iii) π
i
is not unique with respect to other solutions π
j
with j>i (that is, having greater p-values) that it is conflicting with.
-
3.
Extension with dynamic programming: optimal overlap solutions are identified according to Eq. 2, where ties are broken in favor of longer valid alignments.
This approach allows us to report multiple overlap alignments (including split alignments) for an experimental map while using the q-value analysis, as before, to report all alignments with q≤0.01. For the q-value analysis, we consider all candidate solutions found for the sliding windows in order to learn the q-value parameters. In addition, we can reuse the dynamic programming matrix computed for each seed across sub-map alignments and thus complete the overlap alignment with the same asymptotic time and space complexity as the glocal alignment.