 Research
 Open Access
 Open Peer Review
 Published:
OPTIMA: sensitive and accurate wholegenome alignment of errorprone genomic maps by combinatorial indexing and technologyagnostic statistical analysis
GigaScience volume 5, Article number: 2 (2016)
Abstract
Background
Resolution of complex repeat structures and rearrangements in the assembly and analysis of large eukaryotic genomes is often aided by a combination of highthroughput sequencing and genomemapping technologies (for example, optical restriction mapping). In particular, mapping technologies can generate sparse maps of large DNA fragments (150 kilo base pairs (kbp) to 2 Mbp) and thus provide a unique source of information for disambiguating complex rearrangements in cancer genomes. Despite their utility, combining highthroughput sequencing and mapping technologies has been challenging because of the lack of efficient and sensitive mapalignment algorithms for robustly aligning errorprone maps to sequences.
Results
We introduce a novel seedandextend glocal (short for globallocal) alignment method, OPTIMA (and a slidingwindow extension for overlap alignment, OPTIMAOverlap), which is the first to create indexes for continuousvalued mapping data while accounting for mapping errors. We also present a novel statistical model, agnostic with respect to technologydependent error rates, for conservatively evaluating the significance of alignments without relying on expensive permutationbased tests.
Conclusions
We show that OPTIMA and OPTIMAOverlap outperform other stateoftheart approaches (1.6−2 times more sensitive) and are more efficient (170−200 %) and precise in their alignments (nearly 99 % precision). These advantages are independent of the quality of the data, suggesting that our indexing approach and statistical evaluation are robust, provide improved sensitivity and guarantee high precision.
Background
In recent years, the availability of commercial platforms for highthroughput genome mapping (from, for example, OpGen [1], BioNano Genomics [2] and Nabsys [3] have increased the interest in using these technologies, in combination with highthroughput sequencing data, for applications such as structural variation analysis and genome assembly. In particular, several recent genome assembly projects have highlighted their utility for obtaining highquality assemblies of large eukaryotic genomes (for example, goat [4] and budgerigar [5] genomes) or studying complex genomic regions [6] and cancer genomes [7]. Mapping technologies typically provide sparse information (an ordered enumeration of fragment sizes between consecutive genomic patterns, for example, restriction sites) for very large fragments of DNA (150 kilo base pairs (kbp) to 2 Mbp) and are thus orthogonal in utility to sequencing approaches that provide basepair level information for smaller fragments. Combining these two pieces of information therefore requires effective algorithms to align maps to sequences.
Alignment of maps (typically called Rmaps, for restriction maps [8]) to sequences has been widely studied as an algorithmic problem [9] with a range of practical applications, from genome scaffolding [10] to assembly improvement [11] and validation [12]. The general approach has been to translate sequence data to get in silico maps and compare these to experimentally obtained maps using dynamic programming algorithms. For large genomes and mapping datasets, naive allversusall dynamic programming can be computationally expensive. On the other hand, high error rates in mapping data (optical mapping, for example, can miss one in four restriction sites) has led to the use of modelbased scoring functions for sensitively evaluating alignments [13–15]. These often require prior knowledge and modeling of mapping error rates (for example, fragment sizing errors, false cuts and missing cuts) and can be expensive to compute [13, 16, 17]. Alternative approaches with simpler (nonmodelbased) scoring functions [10] are handicapped by the need to do expensive permutationbased statistical testing to evaluate the significance of alignments, and although recent advances have made this testing more efficient [15], it still scales linearly with genome size. Although these approaches work well for microbial genomes, they typically do not scale well for larger genomes, where they might also have reduced sensitivity. In contrast, commercially available solutions for maptosequence alignment (for example, GenomeBuilder from OpGen) scale better and have been used for the assembly of large eukaryotic genomes [4] but tend to discard a large fraction of the mapping data (more than 90 %) due to reduced sensitivity and correspondingly lead to increased mapping costs for a project.
Mapalignment algorithms are thus faced with the twin challenges of improving sensitivity and precision on the one hand and reducing computational costs for alignment and statistical evaluation on the other hand. An elegant solution to this problem from the field of sequencetosequence alignment is the use of a seedandextend approach [18]. However, because maps represent ordered lists of continuous values, this extension is not straightforward, particularly when multiple sources of mapping errors and their high error rates are taken into account [19]. In addition, because error rates can vary across technologies, and even across different runs on the same machine, it is not clear whether a general sensitive maptosequence aligner is feasible. An efficient statistical testing framework that helps control for false discovery without prior information about error rates is critical for making such an aligner easy to use and applicable across technology platforms.
In this work, we describe how a sorted search index and the use of a composite seeding strategy can help to efficiently and sensitively detect seed maptosequence alignments [20]. Our second contribution is the design of a robust and fast statistical evaluation approach that includes multiple sources of mapping errors in the alignment score and evaluates the significance of the best alignment using all identified feasible solutions. We incorporated these ideas and additional refinements to solve two common alignment problems: glocal alignment, solved with OPTIMA, where an entire map is aligned to a subsequence of a second (typically in silico) map; and overlap alignment, solved with OPTIMAOverlap, where the end of one map is aligned to the beginning of another. When benchmarked against stateoftheart aligners, OPTIMA and OPTIMAOverlap typically provide a strong boost in sensitivity (1.6–2 times) without sacrificing precision of alignments (∼99 %). Moreover, our pilot implementations exhibited runtime improvements over commercially available tools (two times faster than OpGen’s Gentig) and orders of magnitude over published, freely available algorithms and software [10, 13].
Finally, these methods are shown to be robust to variations in error distributions while being agnostic to them, suggesting that the methods can deal with different experimental outcomes of the same technology (for example, different map cards or lane types) as well as being applicable across mapping technologies (with minor modifications for preprocessing of data). Because glocal and overlap alignments form the basis of a range of applications that involve the combination of sequence and mapping data (for example, assembly scaffolding, refinement and validation, structural variation analysis and resolving complex genomic regions), OPTIMA and OPTIMAOverlap can serve as building blocks for these applications, allowing for time and costeffective analyses.
Definitions and problem formulation
Highthroughput genome mapping technologies typically work by linearizing large molecules of DNA (for example, in nanochannels [6]) and using enzymes such as restriction enzymes to recognize and label (for example, by cutting DNA) specific patterns throughout the genome (for example, a 6mer motif). These patterns are then read out (typically, optically) to obtain an ordered set of fragment sizes for each DNA molecule (see Fig. 1a for an example of a map). If corresponding genome sequences or assemblies are available, these can be converted into in silico maps through pattern recognition [16].
Let o_{1},o_{2},…,o_{ m } be the m ordered fragment sizes of an experimentally derived map o and r_{1},r_{2},…,r_{ n } be the n fragment sizes of an in silico map r. For simplicity, we suppose here that m≤n and assume that we can derive standard deviations for in silico fragments, that is, σ_{ j } for r_{ j }, in a technologydependent fashion. In an idealized case, we can define the problem of glocally aligning o to r as a onetoone correspondence between all the fragments of o with a subset of the fragments of r, that is, r_{ l },r_{l+1},…,r_{l+m−1} (we could also reverse the roles of o and r here). In practice, many sources of errors affect experimentally derived maps, including missing cuts, false/extra cuts, missing fragments, fragment sizing errors and spurious maps [13]. In silico maps could also be affected by sequencing or assembly errors [10], but these are less likely to affect alignments because typically they are infrequent. To accommodate errors, we extend the definition of correspondence between map fragments to allow for matches between sets of fragments (see Fig. 1b), as used previously in [10]:
Definition 1 (Feasible match).
A subset of fragments o_{ k },o_{k+1},…,o_{ s } aligned as as a whole entity to a subset of in silico fragments r_{ l },r_{l+1},…,r_{ t } is called a feasible match if and only if:
where C_{ σ }=3 is an appropriate bound if sizing errors are approximately normally distributed.
Definition 2 (Glocal alignment).
A valid glocal alignment is an ordered set of matches M_{1},M_{2},…,M_{ w } between experimental and in silico fragments, such that all the experimental fragments o_{1},o_{2},…,o_{ m } are aligned to a subset of the in silico fragments r_{ t },r_{t+1},…,r_{ v }, and both sets are orderly partitioned by all the matches M_{1…w} without overlaps, with w≤m and w≤v−t+1.
Missing fragments, which usually arise from short fragments below the experimental detection limit (for example, 2 kbp), can be handled in this framework by allowing gaps, that is, with the option of ignoring short fragments for the purpose of the C_{ σ } test (Eq. 1).
Definition 3 (Overlap alignment).
A valid overlap alignment is an ordered set of matches M_{1},M_{2},…,M_{ w } that allows experimental maps and in silico maps to only partially align with each other, with M_{1} and M_{ w } each corresponding to an end of one of the maps (for example, Fig. 1e).
In general, because maps can have truncated ends, we relax the C_{ σ } test to be only an upper bound on matches comprising the ends of experimental maps, for example:
or a lower bound on matches at the ends of in silico maps, for example:
Methods
OPTIMA is the first alignment tool based on the seedandextend paradigm that can deal with erroneous mapping data. The basic paradigm is similar to that used for the alignment of discretevalued sequences (allowing for mismatches and indels) and is as follows. We start by indexing the in silico maps, so that we can use this information later, and find seeds for each experimental map o corresponding to some indexed regions of those sequences. We then extend these seeds by using dynamic programming to try to align the whole experimental map to the corresponding in silico map region. For each map o, feasible solutions – as defined by the index structure, size of the genome and maximum error rate – are then evaluated by a scoring scheme to select the optimal solution. Finally, the statistical significance and uniqueness of the optimal solution are determined by comparing and modeling all the identified feasible solutions.
Indexing continuousvalued seeds
The definition of appropriate seeds is critical in a seedandextend approach in order to maintain a good balance between sensitivity and speed. A direct extension of discretevalued seeds to continuous values is to consider values that are close to each other (as defined by the C_{ σ } bound) as matches. However, as mapping data typically have high error rates [13, 16] and represent short sequences (for example, on average, optical maps contain 10–22 fragments, representing roughly a 250 kbp region of the genome), a seed of c consecutive fragments (cmer) is likely to have low sensitivity unless we use a naive c=1 approach (see Fig. 2 for a comparison) and pay a significant runtime penalty that scales with genome size [14, 16]. Therefore, we propose and validate the following composite seed extension for continuousvalued seeds, analogous to the work on spaced seeds for discretevalued sequences [21].
Definition 4 (Composite seed).
Let \(r_{j_{1}}\), \(r_{j_{2}}\) and \(r_{j_{3}}\) be consecutive restriction fragments from a reference in silico map. A continuousvalued composite seed, for c=2, is given by including all of the following:

(i) the cmer \(r_{j_{1}}, r_{j_{2}}\), corresponding to no false cuts in the in silico map;

(ii) the cmer \(r_{j_{1}}+r_{j_{2}}, r_{j_{3}}\), corresponding to a missing cut in the experimental map (or false cut in the in silico map); and

(iii) the cmer \(r_{j_{1}}, r_{j_{2}}+r_{j_{3}}\), corresponding to a different missing cut in the experimental map (or false cut in the in silico map).
The reference index would then contain all ctuples corresponding to a composite seed, as defined in Definition 4, for each location in the reference map. In addition, to account for false cuts in the experimental map, for each set of consecutive fragments \(o_{i_{1}}\), \(o_{i_{2}}\) and \(o_{i_{3}}\) in the experimental maps, we search for ctuples of the type \(o_{i_{1}}, o_{i_{2}}\) and \(o_{i_{1}} + o_{i_{2}}, o_{i_{3}}\) in the index (see Composite seeds (iv) depicted in Fig. 1c).
To index the seeds, we adopt a straightforward approach where all ctuples are collected and sorted into the same index in lexicographic order by c_{1} (where the c_{ i } are elements in the ctuple). Lookups can be performed by binary search over fragmentsized intervals that satisfy the C_{ σ } bound for c_{1} and a subsequent linear scan of the other elements c_{ i }, for i≥2, while verifying the C_{ σ } bound in each case. Note that, because seeds are typically expected to be of higher quality, we can apply a more stringent threshold on seed fragment size matches (for example, we used \(C^{Seed}_{\sigma } = 2\)).
As shown in the “Results and discussion” section, this approach significantly reduces the space of candidate alignments without affecting the sensitivity of the search. A comparison between the various seeding approaches is shown in Fig. 2, which highlights the advantages of composite seeds with respect to 2mers.
Overall, the computational cost of finding seeds using this approach is O(m (logn+c #seeds_{c=1})) per experimental map, where n is the total length of the in silico maps in fragments, m≪n is the length of the experimental map and #seeds_{c=1} is the number of seeds found in the first level of the index lookup, before narrowing down the list to the actual number of seeds that will be extended (#seeds). The cost and space of creating the reference index is thus O(c n), if the number of errors considered in the composite seeds is limited and bounded (as in Definition 4), and radix sort is used to sort the index. This approach drastically reduces the number of alignments computed in comparison to more general, global alignment searches [10], as will be shown later in the “Results and discussion” section.
Dynamic programmingbased extension of seeds
In order to extend a seed to get a glocal alignment we adopt a scoring scheme similar to that used in SOMA (see [10]). This allows us to evaluate alignments without relying on a likelihoodbased framework that requires prior information on error distributions as input [13]. In addition, we can use dynamic programming to efficiently find glocal alignments that optimize this score and contain the seed (see Fig. 1d); specifically, for each seed side we proceed along the dynamic programming matrix by aligning the end of the sth experimental fragment with the end of the tth in silico fragment using backtracking to find feasible matches, that is, those that satisfy Eq. 1 and minimize the total number of cut errors (#cuterrors= missing cuts + false cuts + missing fragments found), with ties being broken by minimizing a χ^{2} function for fragment sizing errors:
where the first index of each subscript represents experimental fragments, the second index represents the in silico fragments, s−k is the number of false cuts, t−l is the number of missing cuts, C_{ ce } is a constant larger than the maximum possible total for χ^{2}, \({\chi }_{k..s, l..t}^{2} = \left (\sum \limits _{i=k}^{s} o_{i}  \sum \limits _{j=l}^{t} r_{j}\right)^{2} / \left (\sum \limits _{j=l}^{t} {\sigma _{j}^{2}}\right)\) and Score_{0,0}=0, Score_{i,0}=∞ and Score_{0,j}=∞.
Note that a small in silico fragment is considered as missing if this condition allows for a valid alignment that improves the local χ^{2} on nearby matches by half (up to three consecutive fragments).
As in [16], we band the dynamic programming and its backtracking to avoid unnecessary computation. In particular, as we show in Supplementary Note 1 in Additional file 1, based on parameter estimates for optical mapping data, restricting alignments to eight missing cuts or five false cuts, consecutively, should retain high sensitivity. In addition, we stop the dynamic programmingbased extension if no feasible solutions can be found for the current seed after having analyzed at least f fragments (default of five).
The computational cost of extending a seed (ctuple) of an experimental map with m fragments is thus, in the worst case, O((m−c) δ^{3}) time, where δ is the bandwidth of the dynamic programming, and O((m−c)^{2}) space for allocating the dynamic programming matrix for each side of the seed.
Statistical significance and uniqueness analysis
To evaluate the statistical significance of a candidate alignment, we exploit the fact that we have explored the space of feasible alignments in our search and use these alignments to approximate a random sample from a (conservative) null model. The assumption here is that there is only one true alignment and that, therefore, the population of these suboptimal alignments can provide a conservative null model for evaluating the significance of an alignment; more specifically, for each candidate alignment found, we compute its distance from the null model in a feature space (to be defined later) using a Zscore transformation and then use this score to evaluate whether it is optimal, statistically significant and unique (see Fig. 3 for an example).
We start by identifying a set F of features, independent with respect to false positive (or random) alignments, that are expected to follow the normal distribution (for example, using the central limit theorem) and be comparable between different alignments of the same experimental map, and we compute a Zscore for each feature f∈F and for each candidate solution π∈Π identified through the seeding method. Each Zscore takes into account the mean and standard deviation of a particular feature f among all candidate alignments found:
Accounting for all considered features f_{ i }, with 1≤i≤k and k≥2, the resulting score is given by:
where s_{ i }=1 if lower values of feature f_{ i } are preferable and −1 otherwise, and the corresponding pvalue is p_{π}=Pnorm(𝜗(π)), that is, the probability that a ‘random’ Zscore will be less than 𝜗(π) under the standard normal distribution.
In our case, we chose a set of features based on the number of matches (#matches), the total number of cut errors and the WilsonHilferty transformation (WHT) of the χ^{2} score for sizing errors (which converges quickly to a standard normal distribution):
Note that this set can be shown to be composed of approximately independent features for false positive alignments (Zscore pairwise covariances between all features did not show a significant alteration of the final Zscores when accounting for them in all of our simulations). By the central limit theorem, the mean of the first two features can be approximated by the normal distribution when the number of candidate solutions is large enough (typically, greater than 60 distinct solutions), and, by Slutsky’s theorem, their sample variance would not have a significant effect on the distribution of the test statistics. As lower values of #cuterrors and WHT(χ^{2},#matches) indicate better solutions, while a higher number of matches represents more reliable alignments, we need to adjust the signs of their Zscores accordingly. The final Zscore 𝜗(π) computed for each candidate solution π is therefore given by:
which can be subsequently converted into the pvalue p_{π}. The candidate solution π^{∗} with the lowest pvalue p^{∗} is reported as the optimal solution, as shown in Fig. 3.
The statistical significance of the optimal solution can then be assessed through a false discovery rate qvalue analysis [22] based on all candidate solutions found for comparable experimental maps, for example, those with the same number of fragments (default of q=0.01). To assess uniqueness, we set a threshold on the ratio of pvalues between the best solution and the nextbest solution (default of five). See Supplementary Note 2 in Additional file 1 for further algorithmic details.
In summary, our statistical scoring approach finds an optimal solution and evaluates its statistical significance and uniqueness in a unified framework, thus allowing for savings in computational time and space compared to a permutation test, without restricting the method to a scenario where experimental error probabilities are known a priori.
Extension to overlap alignment
To extend OPTIMA to compute and evaluate overlap alignments – a key step in assembly pipelines that use mapping data [4, 5, 23] – we use a slidingwindow approach based on OPTIMA. This allows us to continue using the statistical evaluation procedure defined in OPTIMA that relies on learning parameters from comparable alignments (that is, those with the same number, size and order of experimental fragments) in a setting where the final alignments are not always of the same length and structure.
Briefly, for each map, OPTIMAOverlap first finds optimal submap alignments, evaluates their significance and uniqueness, and then tries to extend the candidate alignments found until it reaches the ends of either the experimental map or the in silico map, in order to choose the most significant overlap alignments (see Fig. 1e). This approach begins by dividing an experimental map into submaps of length l with a sliding window and then glocally aligning them to in silico maps using OPTIMA (again allowing for truncated ends to account for high error rates). Each glocal alignment subproblem will then return either:

(i)
a significant and unique submap alignment;

(ii)
an alignment labeled as nonsignificant and/or nonunique (which will be considered as a false alignment); or

(iii)
no feasible alignments found.
Optimal solutions to the subproblems are then ranked by pvalue (smallest to largest) and iterated through to select submaps that should be extended. At each stage we check the significance and uniqueness of the reported solutions (compared to the others), as well as checking for potential cases of identical or conflicting alignments as defined below.
Definition 5 (Conflicting alignments).
A submap alignment π_{1} is said to be conflicting with another alignment π_{2} if either:

the submap of π_{1} overlaps the submap of π_{2}; or

π_{1} aligns to the same in silico map as π_{2}, but in a different location or strand.
Conflicting alignments can result in ambiguous placement of an experimental map on a database of in silico maps, but condition (a) could be relaxed in some cases, for example, when experimental maps are known to overlap multiple in silico maps in the same region. Therefore, while iterating through the list of submaps, the following rules are implemented:

1.
Significance: if the current solution π_{ i } is labeled as a false alignment, then we stop iterating through the rest of the list.

2.
Uniqueness: we skip an alignment if either: (i) π_{ i } represents the same overlap alignment as a more significant solution; (ii) π_{ i } is conflicting with a solution with a lower pvalue (that is, seen before); or (iii) π_{ i } is not unique with respect to other solutions π_{ j } with j>i (that is, having greater pvalues) that it is conflicting with.

3.
Extension with dynamic programming: optimal overlap solutions are identified according to Eq. 2, where ties are broken in favor of longer valid alignments.
This approach allows us to report multiple overlap alignments (including split alignments) for an experimental map while using the qvalue analysis, as before, to report all alignments with q≤0.01. For the qvalue analysis, we consider all candidate solutions found for the sliding windows in order to learn the qvalue parameters. In addition, we can reuse the dynamic programming matrix computed for each seed across submap alignments and thus complete the overlap alignment with the same asymptotic time and space complexity as the glocal alignment.
Results and discussion
Generation of benchmarking datasets
To benchmark OPTIMA and OPTIMAOverlap against other stateoftheart map aligners, we first developed synthetic datasets that aim to represent two ends of the spectrum of errors in mapping data for eukaryotic genomes. These scenarios were defined by confidently aligning (using SOMA [10] and manual curation) two sets of maps from different experimental runs for optical mapping (using the Argus system from OpGen) on a human cell line. The first scenario, (A), was defined based on lanes that were reported by the Argus machine to have high quality scores, while the second scenario, (B), was defined by lanes with the lower qualities that were typically obtained on the system. We estimated three key parameters from the data: d, the average restriction enzyme digestion rate; f_{100}, the average false cut rate per 100 kbp; and the fragment sizing errors for predefined (reference) in silico fragment size ranges (these were fixed for both scenarios and recorded as relative deviations of the empirical sizes from the reference sizes):

Easier scenario: d=0.78 (corresponding to missing cut rate of 22 %); f_{100}=0.97; and probability at 0.5 for missing fragments of size below 1.2 kbp, 0.75 below 600 bp and 1 below 350 bp.

Harder scenario: d=0.61 (corresponding to missing cut rate of 39 %); f_{100}=1.38; and 0.5 for missing fragments of size below 2 kbp, 0.75 below 800 bp and 1 below 350 bp.
For each scenario, we first simulate the map sizes using empirically derived distributions from real maps (average size of approximately 275 kbp, containing 17 fragments) and extract the corresponding reference submaps by sampling start locations uniformly from the in silico maps (possibly creating truncated end fragments). Then we introduce cut errors using the probability distributions described in [13] with the above parameters, that is: first, we remove missing cuts following a Binomial distribution with probability 1−d; next, we insert false cuts modeled as a Poisson process with rate f_{100} (avoiding creation of small fragments less than 1.2 kbp in size); and finally, we remove small fragments with the probabilities described above. Sizing errors are introduced by sampling from the empirical errors found for each range of reference fragment sizes. Simulated experimental maps smaller than 150 kbp or with fewer than ten fragments are discarded, mimicking the preprocessing stage on the Argus system.
We generated 100 times greater coverage of maps with errors for the Drosophila melanogaster (BDGP 5) and Homo sapiens (hg19/GRCh37) genomes using the KpnI restriction pattern GGTAC’C, where the apostrophe indicates the position of the cut, which resulted in 13,920 fragments genomewide (forward and reverse strands) with an average fragment size (AFS) of 17.3 kbp and 573,276 fragments with AFS = 10.8 kbp, respectively.
Glocal alignment results
OPTIMA was compared against the stateoftheart algorithms Gentig v.2 (alignment module) [16, 17, 24], SOMA v.2 [10] and Valouev’s likelihoodbased fit alignment [13] for glocally aligning the simulated maps over their respective in silico reference genomes. TWIN [19] was not included in this comparison as it does not allow for errors and missing information in experimental maps.
We also ran variations of these algorithms from their default options (d): specifically, by providing the true error distribution parameters used in the simulations as input (tp), the adjusted AFS based on the organism under analysis (a) and parameter values published in the respective papers (instead of the software’s default ones), to provide, in addition, the true error distribution rates (p); and by allowing the trimming of map ends in the alignment (t). Moreover, SOMA [10] was modified to correctly handle missing in silico fragments up to 2 kbp, to run only for C_{ σ }=3, to make its results comparable, and by inverting the role of in silico–experimental input maps (v). We omitted SOMA’s statistical test (also for Valouev’s likelihood method, where it is not enabled by default), because it is unfeasible for large datasets [19], and applied only its uniqueness test (Ftest). Further details about the running parameters are provided in Supplementary Note 3 in Additional file 1. OPTIMA alignments were performed on both strands of the in silico maps, without trimming end fragments or removing any small in silico fragments.
As can be seen from the results in Table 1, OPTIMA reports alignments with very high precision, greater than 99 % in most cases, independent of the genome size and the dataset error rate. In comparison, Gentig has similarly high precision on the Drosophila genome but lower precision on the human genome, with as low as 80 % precision under scenario (B) (with default parameters). Without their computationally expensive statistical tests, which can increase the runtime by a factor of greater than 100, SOMA and the likelihoodbased method have much lower precision, particularly on the human genome. In addition, in terms of sensitivity, OPTIMA was found to be notably better than other aligners, even when the true error distribution rates were provided as input to these algorithms. In particular, for the higher quality scenario (A), OPTIMA is more than 1.5 times as sensitive as Gentig, and for the commonly obtained scenario (B), OPTIMA is more than twice as sensitive as Gentig.
These results are further broken down in Fig. 4 as a function of the number of fragments in the experimental maps, showing that OPTIMA uniformly achieves more than twice the sensitivity for the smaller maps that are frequently obtained in real datasets. The relatively higher sensitivities of SOMA and the likelihoodbased method in these experiments are likely to be artifacts of relaxed settings in the absence of their statistical tests. These results highlight OPTIMA’s high precision and improved sensitivity across experimental conditions and suggest that it could adapt well to other experimental settings.
In Table 2, we further compare all methods on their running time and worstcase complexity (runtime and space). It can be seen that SOMA and the likelihoodbased methods are at least an order of magnitude slower than OPTIMA and Gentig. Gentig’s proprietary algorithm is based on work that has been previously published [17, 24], but its current version uses an unpublished hashing approach. In comparison, OPTIMA is two times faster while being more than 50 % more sensitive than Gentig.
Real data analysis and comparison
To assess OPTIMA’s performance on real data we generated, inhouse, 309,879 and 296,217 optical maps for two human cell lines, GM12878 and HCT116, respectively, using the Argus system from OpGen [4, 25] (with the KpnI enzyme and ten map cards in total), and glocally aligned them over the human reference genome. Supplementary Note 4 in Additional file 1 provides the sizing error statistics.
Tables 3 and 4 report statistics of the alignments for raw maps filtered under two settings:

(r)
relaxed filtering, which filters maps with fewer than ten fragments and smaller than 150 kbp;

(s)
stringent filtering (as suggested in [4]), which filters maps with fewer than 12 fragments and smaller than 250 kbp.
The statistics were reported independently for each map card to capture the variability in terms of quality. In total, OPTIMA, with a stringent uniqueness threshold of 30, confidently aligned nearly three times as many maps as Gentig (with default parameters) for GM12878. Similarly, for HCT116, OPTIMA results were 1.7 times better than Gentig results, and corresponding improvements were also obtained using the stringently filtered datasets.
Further analyzing the error rates in the maps that OPTIMA confidently aligned (Tables 3 and 4), we observed that the overall statistics for average digestion rate d, false/extra cut rate f_{100} and sizing errors were found to be similar to those obtained using scenario (B) (see Supplementary Note 5 in Additional file 1).
Overlap alignment results
For overlap alignment, we compared OPTIMAOverlap with an overlapfinding extension of Gentig v.2 (implemented in the commercial software GenomeBuilder from OpGen, which contains a module called SCAFFOLDEXTENDER) [17, 24], as well as with Valouev’s likelihoodoverlap method [13].
In our first test, we randomly selected 1000 maps for each scenario (A) and (B) from our previously simulated maps for Drosophila (BDGP 5) and human (hg19/GRCh37) genomes. In addition, we simulated assembled sequence fragments for these genomes based on empirically derived scaffold size distributions (Drosophila assembly N50 of 2.7 Mbp with 239 scaffolds and human assembly N50 of 3.0 Mbp with 98,987 scaffolds); the simulated assemblies were used to generate in silico maps (filtered for those with fewer than four nonend fragments, because these cannot be confidently aligned [14, 15]), which were then aligned with the simulated experimental maps.
For our second test, we compared all methods on optical mapping data generated inhouse from the K562 human cancer cell line on OpGen’s Argus system, where a random sample of 2000 maps with at least ten fragments was extracted. In silico maps were generated from de novo assemblies of shotgun Illumina sequencing data (HiSeq) and six matepair libraries with insert sizes ranging from 1.1 kbp to 25 kbp [26] (the final assembly had an N50 of 3.1 Mbp and 76,990 scaffolds in total, using SOAPdenovo v.1.05 [27] with a kmer size of 51 for contig assembly and Opera v.1.1 [28] for scaffolding with mate pairs). It is likely that this dataset represents a harder scenario, with assembly gaps/errors and genomic rearrangements confounding the analysis. It also represents a likely use case where mapping data will be critical to detect large structural variations, disambiguate complex rearrangements and, ultimately, assemble cancer genomes de novo.
For each test, we evaluated the precision of alignments as well as the number of (correctly) reported alignments that provide an extension to the in silico maps through experimentally determined fragments, as this is key for the application of overlap alignments in genome assembly. We begin by noting that there is an important tradeoff between sensitivity with a specific window size in OPTIMAOverlap and the correctness of alignments, as can be seen in Fig. 5. As expected, even though small window sizes (less than ten in Fig. 5) provide more sensitive results, they also make true alignments indistinguishable from noise and reduce the number of correct alignments detected; on the other hand, larger window sizes improve the signaltonoise ratio but result in a drop in sensitivity. This leads to a sweet spot in the middle (10–13 fragments) where the method is most sensitive across a range of datasets. In particular, real datasets are slightly more challenging than our simulations (see human (B) compared to real data in Fig. 5) and so we have conservatively chosen a window size of 12 as the default for OPTIMAOverlap. By benchmarking OPTIMAOverlap with this setting, we observed high precision similar to that observed with OPTIMA for glocal alignment (Table 5). This was seen uniformly across datasets with disparate profiles in terms of genome size and error rates, suggesting that our statistical evaluation is reasonably robust. As before, we also note that Gentig’s approach and the likelihoodbased method might not always exhibit high precision. Finally, in terms of sensitivity, OPTIMAOverlap shows a 30–80 % improvement over competing approaches, and this is also seen in the harder real datasets.
Utility in realworld applications
Overlap alignments form a critical building block for applications such as OpGen’s GenomeBuilder and its use in boosting assembly quality [4]. As OPTIMAOverlap can work with lower quality data (scenario (B) in our simulations; GenomeBuilder would typically filter out such data) and also provide improved sensitivity in detecting overlap alignments, we estimate that its use could reduce the requirement for generating mapping data by half. As the cost of mapping data for the assembly of large eukaryotic genomes can range from USD 20,000 to 100,000, this can lead to significant cost savings.
We similarly compared OPTIMA and Gentig on the two human cell line results, shown in Tables 3 and 4, in order to calculate how much mapping data would be needed for sufficient aligned coverage of the human genome to enable structural variation analysis. By analyzing the alignment rate increase of OPTIMA compared to Gentig, a 1.5 to 2.9 times increase on average, we computed the corresponding cost reduction to be 33–66 %, with an average cost reduction of 54 % for relaxed filtering of data (r) and 49 % for stringent filtering (s). These results suggest that for structural variation analysis on the human genome, particularly for cancer genomes, OPTIMA can significantly reduce project costs (in the tens of thousands of dollars) while enabling faster analyses of the data.
Conclusion
With the availability of new mapping technologies (for example, Nabsys) and greater use of existing ones to complement highthroughput sequencing, there is a critical need for robust computational tools that can combine genomic mapping and sequence data efficiently. In this work, we introduce two new alignment tools that address this need for a wide range of applications, from genome assembly to structural variation analysis. Our benchmarking results provide evidence that these methods outperform existing approaches in sensitivity and runtime while providing highly precise alignments in a range of experimental settings. Similar results are also seen in real datasets from human cell lines, suggesting that they could help in significantly reducing the cost of optical mapping analysis and thus increase its usage.
In the development of OPTIMA and OPTIMAOverlap, we establish two key new ideas for map alignment. The first is the introduction of composite seeds, an idea that echoes the idea of spaced seeds in the context of continuousvalued sequence alignment. Composite seeds allow us to develop efficient seedandextend aligners for maptosequence alignment of erroneous mapping data. We believe that similar ideas can be applied for maptomap alignment and de novo assembly of experimental maps. The second concept is the development of a conservative statistical testing approach that does not require knowledge of the true distribution of errors or an expensive permutation test to evaluate the uniqueness and significance of alignments. This allows us to significantly reduce the runtime cost of this step without sacrificing specificity or the ability to be agnostic with respect to error rates. Although our experiments with real data in this work were limited to data generated on the Argus system from OpGen, similar ideas (with minor variations) should also be applicable to data generated by other technologies such as the Irys platform from BioNano Genomics.
In future work, we plan to implement further runtime and memory optimizations to OPTIMA and OPTIMAOverlap and to explore their use for superscaffolding of large genomes as well as for studying genomic rearrangements in cancer.
Availability and requirements

Project name: OPTIMA: Indexbased maptosequence alignment in large eukaryotic genomes

Project home page:http://www.davideverzotto.it/research/OPTIMA, https://github.com/verznet/OPTIMA

Operating system: Platform independent

Programming language: Java 7+

Other requirements: Java Development Kit 7+, Apache Commons Math 3.2, CERN Colt 1.2.0

License: LGPL (Lesser General Public License) 2.1, OSI compliant

Any restrictions to use by nonacademics: none
References
 1
OpGen Inc. 2015. Available from: http://www.opgen.com. Accessed 9 Dec 2015.
 2
BioNano Genomics Inc. 2015. Available from: http://www.bionanogenomics.com. Accessed 9 Dec 2015.
 3
Nabsys Inc. 2015. Available from: http://www.nabsys.com. Accessed 9 Dec 2015.
 4
Dong Y, Xie M, Jiang Y, Xiao N, Du X, Wang W, et al.Sequencing and automated wholegenome optical mapping of the genome of a domestic goat (Capra hircus). Nat Biotechnol. 2013; 31:135–41.
 5
Ganapathy G, Howard J, Ward J, Zhang G, Phillippy A, Jarvis E, et al.Highcoverage sequencing and annotated assemblies of the budgerigar genome. GigaScience. 2014; 3:11.
 6
Lam ET, Hastie A, Lin C, Ehrlich D, Nagarajan N, Kwok PY, et al.Genome mapping on nanochannel arrays for structural variation analysis and sequence assembly. Nat Biotechnol. 2012; 30:771–6.
 7
Ray M, Goldstein S, Zhou S, Potamousis K, Sarkar D, Schwartz D, et al.Discovery of structural alterations in solid tumor oligodendroglioma by single molecule analysis. BMC Genomics. 2013; 14:505.
 8
Waterman MS, Smith TF, Katcher H. Algorithms for restriction map comparisons. Nucleic Acids Res. 1984; 12:237–42.
 9
Mendelowitz L, Pop M. Computational methods for optical mapping. GigaScience. 2014; 3:33.
 10
Nagarajan N, Read TD, Pop M. Scaffolding and validation of bacterial genome assemblies using optical restriction maps. Bioinformatics. 2008; 24:1229–35.
 11
Lin H, Goldstein S, Mendelowitz L, Zhou S, Schwartz D, Pop M, et al.AGORA: assembly guided by optical restriction alignment. BMC Bioinforma. 2012; 13:189.
 12
Antoniotti M, Anantharaman T, Paxia S, Mishra B. Genomics via optical mapping IV: sequence validation via optical map matching. New York, NY, USA: New York University; 2001.
 13
Valouev A, Li L, Liu YC, Schwartz DC, Yang Y, Waterman MS, et al.Alignment of optical maps. J Comput Biol. 2006; 13:442–62.
 14
Anantharaman TS, Mishra B. A probabilistic analysis of false positives in optical map alignment and validation. In: First international workshop on algorithms in bioinformatics (WABI). Aarhus, Denmark: Springer: 2001. p. 27–40.
 15
Sarkar D, Goldstein S, Schwartz DC, Newton MA. Statistical significance of optical map alignments. J Comput Biol. 2012; 19:478–92.
 16
Anantharaman TS, Mishra B, Schwartz DC. Genomics via optical mapping II: ordered restriction maps. J Comput Biol. 1997; 4:91–118.
 17
Anantharaman TS, Mishra B, Schwartz DC. Genomics via optical mapping III: contiging genomic DNA. In: Proceedings of the 7th international conference on intelligent systems for molecular biology (ISMB 1999). Heidelberg, Germany: AAAI: 1999. p. 18–27.
 18
Karp RM, Rabin MO. Efficient randomized patternmatching algorithms. IBM J Res Dev. 1987; 31:249–60.
 19
Muggli MD, Puglisi SJ, Boucher C. Efficient indexed alignment of contigs to optical maps. In: Algorithms in Bioinformatics (WABI 2014). vol. 8701 of LNCS. Wrocław, Poland: Springer: 2014. p. 68–81.
 20
Verzotto D, Teo ASM, Hillmer AM, Nagarajan N. Indexbased maptosequence alignment in large eukaryotic genomes. In: Proceedings of the fifth RECOMB satellite workshop on massively parallel sequencing (RECOMBSeq 2015). Warsaw (Poland): Cold Spring Harbor Labs Journals: 2015. p. 1–11. doi:10.1101/017194.
 21
Califano A, Rigoutsos I. FLASH: a fast lookup algorithm for string homology. In: Proceedings of the 1st International Conference on Intelligent Systems for Molecular Biology (ISMB 1993). Bethesda, MD, USA: AAAI: 1993. p. 56–64.
 22
Storey JD, Tibshirani R. Statistical significance for genomewide studies. Proc Nat Acad Sci. 2003; 100:9440–5.
 23
Kawahara Y, de la Bastide M, Hamilton J, Kanamori H, McCombie R, Matsumoto T, et al.Improvement of the Oryza sativa Nipponbare reference genome using next generation sequence and optical map data. Rice. 2013; 6:4.
 24
Anantharaman TS, Mysore V, Mishra B. Fast and cheap genome wide haplotype construction via optical mapping In: Altman RB, Jung TA, Klein TE, Dunker AK, Hunter L, editors. Proceedings of the 10th Pacific Symposium on Biocomputing (PSB 2005). Hawaii, USA: World Scientific: 2005. p. 1–16.
 25
Teo ASM, Verzotto D, Yao F, Nagarajan N, Hillmer AM. Singlemolecule optical genome mapping of a human HapMap and a colorectal cancer cell line. GigaScience. 2015; 4:65. doi:10.1186/s1374201501061.
 26
Yao F, Ariyaratne PN, Hillmer AM, Liu ET, Ruan Y. Long span DNA PairedEndTag (DNAPET) sequencing strategy for the interrogation of genomic structural mutations and fusionpointguided reconstruction of amplicons. PLOS ONE. 2012; e46152:7.
 27
Li R, Zhu H, Ruan J, Wang J, Fang X, Wang J, et al.De novo assembly of human genomes with massively parallel short read sequencing. Genome Res. 2010; 20:265–72.
 28
Gao S, Sung WK, Nagarajan N. Opera: Reconstructing Optimal Genomic Scaffolds with HighThroughput PairedEnd Sequences. J Comput Biol. 2011; 18:1681–91.
 29
Verzotto D, Teo ASM, Hillmer AM, Nagarajan N. Supporting software for OPTIMA, a tool for sensitive and accurate wholegenome alignment of errorprone genomic maps by combinatorial indexing and technologyagnostic statistical analysis; 2015. GigaScience Database. doi:10.5524/100165.
 30
Teo ASM, Verzotto D, Yao F, Nagarajan N, Hillmer AM. Supporting singlemolecule optical genome mapping data from human HapMap and colorectal cancer cell lines; 2015. GigaScience Database. doi:10.5524/100182.
Acknowledgments
We would like to thank Juntao Li for useful discussions on the statistical analysis and Denis Bertrand and Burton KH Chia for valuable feedback and comments on a draft of this manuscript. We thank Fei Yao for help on experiments and troubleshooting of the Argus system and Chee Seng Chan and Lavanya Veeravalli for postprocessing the optical mapping data.
Author information
Additional information
Competing interests
This work was partly supported under a research collaboration agreement with Sciencewerke Pte. Ltd., the Singapore distributor for OpGen Inc. No employees of Sciencewerke or OpGen played a role in the work described in this manuscript. DV and NN are inventors on a patent application related to this work.
Authors’ contributions
The overall project was conceived and planned by AMH and NN, while DV and NN defined subprojects of this work. ASMT carried out all experiments, to generate the optical mapping data used in this manuscript, under the guidance of AMH. DV designed and implemented the algorithms and conducted the analysis under the guidance of NN. DV and NN wrote the manuscript and all authors read and approved the final manuscript.
Additional file
Additional file 1
Supplementary material. Supplementary notes and figures to accompany the main manuscript. (PDF 32358 kb)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Verzotto, D., M. Teo, A.S., Hillmer, A.M. et al. OPTIMA: sensitive and accurate wholegenome alignment of errorprone genomic maps by combinatorial indexing and technologyagnostic statistical analysis. GigaSci 5, 2 (2016) doi:10.1186/s1374201601100
Received
Accepted
Published
DOI
Keywords
 Optical mapping
 Genomic mapping
 Glocal alignment
 Overlap alignment
 Maptosequence alignment