OPTIMA: sensitive and accurate whole-genome alignment of error-prone genomic maps by combinatorial indexing and technology-agnostic statistical analysis

Verzotto, Davide; M. Teo, Audrey S.; Hillmer, Axel M.; Nagarajan, Niranjan

doi:10.1186/s13742-016-0110-0

Research
Open access
Published: 19 January 2016

OPTIMA: sensitive and accurate whole-genome alignment of error-prone genomic maps by combinatorial indexing and technology-agnostic statistical analysis

Davide Verzotto¹,
Audrey S. M. Teo²,
Axel M. Hillmer² &
…
Niranjan Nagarajan¹

GigaScience volume 5, Article number: 2 (2016) Cite this article

5657 Accesses
6 Citations
29 Altmetric
Metrics details

Abstract

Background

Resolution of complex repeat structures and rearrangements in the assembly and analysis of large eukaryotic genomes is often aided by a combination of high-throughput sequencing and genome-mapping technologies (for example, optical restriction mapping). In particular, mapping technologies can generate sparse maps of large DNA fragments (150 kilo base pairs (kbp) to 2 Mbp) and thus provide a unique source of information for disambiguating complex rearrangements in cancer genomes. Despite their utility, combining high-throughput sequencing and mapping technologies has been challenging because of the lack of efficient and sensitive map-alignment algorithms for robustly aligning error-prone maps to sequences.

Results

We introduce a novel seed-and-extend glocal (short for global-local) alignment method, OPTIMA (and a sliding-window extension for overlap alignment, OPTIMA-Overlap), which is the first to create indexes for continuous-valued mapping data while accounting for mapping errors. We also present a novel statistical model, agnostic with respect to technology-dependent error rates, for conservatively evaluating the significance of alignments without relying on expensive permutation-based tests.

Conclusions

We show that OPTIMA and OPTIMA-Overlap outperform other state-of-the-art approaches (1.6−2 times more sensitive) and are more efficient (170−200 %) and precise in their alignments (nearly 99 % precision). These advantages are independent of the quality of the data, suggesting that our indexing approach and statistical evaluation are robust, provide improved sensitivity and guarantee high precision.

Peer Review reports

Background

In recent years, the availability of commercial platforms for high-throughput genome mapping (from, for example, OpGen [1], BioNano Genomics [2] and Nabsys [3] have increased the interest in using these technologies, in combination with high-throughput sequencing data, for applications such as structural variation analysis and genome assembly. In particular, several recent genome assembly projects have highlighted their utility for obtaining high-quality assemblies of large eukaryotic genomes (for example, goat [4] and budgerigar [5] genomes) or studying complex genomic regions [6] and cancer genomes [7]. Mapping technologies typically provide sparse information (an ordered enumeration of fragment sizes between consecutive genomic patterns, for example, restriction sites) for very large fragments of DNA (150 kilo base pairs (kbp) to 2 Mbp) and are thus orthogonal in utility to sequencing approaches that provide base-pair level information for smaller fragments. Combining these two pieces of information therefore requires effective algorithms to align maps to sequences.

Alignment of maps (typically called Rmaps, for restriction maps [8]) to sequences has been widely studied as an algorithmic problem [9] with a range of practical applications, from genome scaffolding [10] to assembly improvement [11] and validation [12]. The general approach has been to translate sequence data to get in silico maps and compare these to experimentally obtained maps using dynamic programming algorithms. For large genomes and mapping datasets, naive all-versus-all dynamic programming can be computationally expensive. On the other hand, high error rates in mapping data (optical mapping, for example, can miss one in four restriction sites) has led to the use of model-based scoring functions for sensitively evaluating alignments [13–15]. These often require prior knowledge and modeling of mapping error rates (for example, fragment sizing errors, false cuts and missing cuts) and can be expensive to compute [13, 16, 17]. Alternative approaches with simpler (non-model-based) scoring functions [10] are handicapped by the need to do expensive permutation-based statistical testing to evaluate the significance of alignments, and although recent advances have made this testing more efficient [15], it still scales linearly with genome size. Although these approaches work well for microbial genomes, they typically do not scale well for larger genomes, where they might also have reduced sensitivity. In contrast, commercially available solutions for map-to-sequence alignment (for example, Genome-Builder from OpGen) scale better and have been used for the assembly of large eukaryotic genomes [4] but tend to discard a large fraction of the mapping data (more than 90 %) due to reduced sensitivity and correspondingly lead to increased mapping costs for a project.

Map-alignment algorithms are thus faced with the twin challenges of improving sensitivity and precision on the one hand and reducing computational costs for alignment and statistical evaluation on the other hand. An elegant solution to this problem from the field of sequence-to-sequence alignment is the use of a seed-and-extend approach [18]. However, because maps represent ordered lists of continuous values, this extension is not straightforward, particularly when multiple sources of mapping errors and their high error rates are taken into account [19]. In addition, because error rates can vary across technologies, and even across different runs on the same machine, it is not clear whether a general sensitive map-to-sequence aligner is feasible. An efficient statistical testing framework that helps control for false discovery without prior information about error rates is critical for making such an aligner easy to use and applicable across technology platforms.

In this work, we describe how a sorted search index and the use of a composite seeding strategy can help to efficiently and sensitively detect seed map-to-sequence alignments [20]. Our second contribution is the design of a robust and fast statistical evaluation approach that includes multiple sources of mapping errors in the alignment score and evaluates the significance of the best alignment using all identified feasible solutions. We incorporated these ideas and additional refinements to solve two common alignment problems: glocal alignment, solved with OPTIMA, where an entire map is aligned to a subsequence of a second (typically in silico) map; and overlap alignment, solved with OPTIMA-Overlap, where the end of one map is aligned to the beginning of another. When benchmarked against state-of-the-art aligners, OPTIMA and OPTIMA-Overlap typically provide a strong boost in sensitivity (1.6–2 times) without sacrificing precision of alignments (∼99 %). Moreover, our pilot implementations exhibited runtime improvements over commercially available tools (two times faster than OpGen’s Gentig) and orders of magnitude over published, freely available algorithms and software [10, 13].

Finally, these methods are shown to be robust to variations in error distributions while being agnostic to them, suggesting that the methods can deal with different experimental outcomes of the same technology (for example, different map cards or lane types) as well as being applicable across mapping technologies (with minor modifications for pre-processing of data). Because glocal and overlap alignments form the basis of a range of applications that involve the combination of sequence and mapping data (for example, assembly scaffolding, refinement and validation, structural variation analysis and resolving complex genomic regions), OPTIMA and OPTIMA-Overlap can serve as building blocks for these applications, allowing for time- and cost-effective analyses.

Definitions and problem formulation

High-throughput genome mapping technologies typically work by linearizing large molecules of DNA (for example, in nanochannels [6]) and using enzymes such as restriction enzymes to recognize and label (for example, by cutting DNA) specific patterns throughout the genome (for example, a 6-mer motif). These patterns are then read out (typically, optically) to obtain an ordered set of fragment sizes for each DNA molecule (see Fig. 1 a for an example of a map). If corresponding genome sequences or assemblies are available, these can be converted into in silico maps through pattern recognition [16].

Let o₁,o₂,…,o_m be the m ordered fragment sizes of an experimentally derived map o and r₁,r₂,…,r_n be the n fragment sizes of an in silico map r. For simplicity, we suppose here that m≤n and assume that we can derive standard deviations for in silico fragments, that is, σ_j for r_j, in a technology-dependent fashion. In an idealized case, we can define the problem of glocally aligning o to r as a one-to-one correspondence between all the fragments of o with a subset of the fragments of r, that is, r_l,r_l+1,…,r_l+m−1 (we could also reverse the roles of o and r here). In practice, many sources of errors affect experimentally derived maps, including missing cuts, false/extra cuts, missing fragments, fragment sizing errors and spurious maps [13]. In silico maps could also be affected by sequencing or assembly errors [10], but these are less likely to affect alignments because typically they are infrequent. To accommodate errors, we extend the definition of correspondence between map fragments to allow for matches between sets of fragments (see Fig. 1 b), as used previously in [10]:

Definition 1 (Feasible match).

A subset of fragments o_k,o_k+1,…,o_s aligned as as a whole entity to a subset of in silico fragments r_l,r_l+1,…,r_t is called a feasible match if and only if:

$$ \left|{\frac{\sum\limits_{i=k}^{s} o_{i} - \sum\limits_{j=l}^{t} r_{j}}{\sqrt{\sum\limits_{j=l}^{t} {\sigma_{j}^{2}}}}}\right| \le C_{\sigma}, $$

((1))

where C_σ=3 is an appropriate bound if sizing errors are approximately normally distributed.

Definition 2 (Glocal alignment).

A valid glocal alignment is an ordered set of matches M₁,M₂,…,M_w between experimental and in silico fragments, such that all the experimental fragments o₁,o₂,…,o_m are aligned to a subset of the in silico fragments r_t,r_t+1,…,r_v, and both sets are orderly partitioned by all the matches M_1…w without overlaps, with w≤m and w≤v−t+1.

Missing fragments, which usually arise from short fragments below the experimental detection limit (for example, 2 kbp), can be handled in this framework by allowing gaps, that is, with the option of ignoring short fragments for the purpose of the C_σ test (Eq. 1).

Definition 3 (Overlap alignment).

A valid overlap alignment is an ordered set of matches M₁,M₂,…,M_w that allows experimental maps and in silico maps to only partially align with each other, with M₁ and M_w each corresponding to an end of one of the maps (for example, Fig. 1 e).

In general, because maps can have truncated ends, we relax the C_σ test to be only an upper bound on matches comprising the ends of experimental maps, for example:

$$\sum\limits_{i=k}^{m} o_{i} - \sum\limits_{j=l}^{t} r_{j} \le C_{\sigma} \sqrt{\sum\limits_{j=l}^{t} {\sigma_{j}^{2}}}, $$

or a lower bound on matches at the ends of in silico maps, for example:

$$\sum\limits_{i=k}^{s} o_{i} - \sum\limits_{j=l}^{v} r_{j} \ge C_{\sigma} \sqrt{\sum\limits_{j=l}^{v} {\sigma_{j}^{2}}}. $$

Methods

OPTIMA is the first alignment tool based on the seed-and-extend paradigm that can deal with erroneous mapping data. The basic paradigm is similar to that used for the alignment of discrete-valued sequences (allowing for mismatches and indels) and is as follows. We start by indexing the in silico maps, so that we can use this information later, and find seeds for each experimental map o corresponding to some indexed regions of those sequences. We then extend these seeds by using dynamic programming to try to align the whole experimental map to the corresponding in silico map region. For each map o, feasible solutions – as defined by the index structure, size of the genome and maximum error rate – are then evaluated by a scoring scheme to select the optimal solution. Finally, the statistical significance and uniqueness of the optimal solution are determined by comparing and modeling all the identified feasible solutions.

Indexing continuous-valued seeds

The definition of appropriate seeds is critical in a seed-and-extend approach in order to maintain a good balance between sensitivity and speed. A direct extension of discrete-valued seeds to continuous values is to consider values that are close to each other (as defined by the C_σ bound) as matches. However, as mapping data typically have high error rates [13, 16] and represent short sequences (for example, on average, optical maps contain 10–22 fragments, representing roughly a 250 kbp region of the genome), a seed of c consecutive fragments (c-mer) is likely to have low sensitivity unless we use a naive c=1 approach (see Fig. 2 for a comparison) and pay a significant runtime penalty that scales with genome size [14, 16]. Therefore, we propose and validate the following composite seed extension for continuous-valued seeds, analogous to the work on spaced seeds for discrete-valued sequences [21].

Definition 4 (Composite seed).

Let $r_{j_{1}}$, $r_{j_{2}}$ and $r_{j_{3}}$ be consecutive restriction fragments from a reference in silico map. A continuous-valued composite seed, for c=2, is given by including all of the following:

(i) the c-mer $r_{j_{1}}, r_{j_{2}}$, corresponding to no false cuts in the in silico map;
(ii) the c-mer $r_{j_{1}}+r_{j_{2}}, r_{j_{3}}$, corresponding to a missing cut in the experimental map (or false cut in the in silico map); and
(iii) the c-mer $r_{j_{1}}, r_{j_{2}}+r_{j_{3}}$, corresponding to a different missing cut in the experimental map (or false cut in the in silico map).

The reference index would then contain all c-tuples corresponding to a composite seed, as defined in Definition 4, for each location in the reference map. In addition, to account for false cuts in the experimental map, for each set of consecutive fragments $o_{i_{1}}$, $o_{i_{2}}$ and $o_{i_{3}}$ in the experimental maps, we search for c-tuples of the type $o_{i_{1}}, o_{i_{2}}$ and $o_{i_{1}} + o_{i_{2}}, o_{i_{3}}$ in the index (see Composite seeds (iv) depicted in Fig. 1 c).

To index the seeds, we adopt a straightforward approach where all c-tuples are collected and sorted into the same index in lexicographic order by c₁ (where the c_i are elements in the c-tuple). Lookups can be performed by binary search over fragment-sized intervals that satisfy the C_σ bound for c₁ and a subsequent linear scan of the other elements c_i, for i≥2, while verifying the C_σ bound in each case. Note that, because seeds are typically expected to be of higher quality, we can apply a more stringent threshold on seed fragment size matches (for example, we used $C^{Seed}_{\sigma } = 2$).

As shown in the “Results and discussion” section, this approach significantly reduces the space of candidate alignments without affecting the sensitivity of the search. A comparison between the various seeding approaches is shown in Fig. 2, which highlights the advantages of composite seeds with respect to 2-mers.

Overall, the computational cost of finding seeds using this approach is O(m (logn+c #seeds_c=1)) per experimental map, where n is the total length of the in silico maps in fragments, m≪n is the length of the experimental map and #seeds_c=1 is the number of seeds found in the first level of the index lookup, before narrowing down the list to the actual number of seeds that will be extended (#seeds). The cost and space of creating the reference index is thus O(c n), if the number of errors considered in the composite seeds is limited and bounded (as in Definition 4), and radix sort is used to sort the index. This approach drastically reduces the number of alignments computed in comparison to more general, global alignment searches [10], as will be shown later in the “Results and discussion” section.

Dynamic programming-based extension of seeds

In order to extend a seed to get a glocal alignment we adopt a scoring scheme similar to that used in SOMA (see [10]). This allows us to evaluate alignments without relying on a likelihood-based framework that requires prior information on error distributions as input [13]. In addition, we can use dynamic programming to efficiently find glocal alignments that optimize this score and contain the seed (see Fig. 1 d); specifically, for each seed side we proceed along the dynamic programming matrix by aligning the end of the sth experimental fragment with the end of the tth in silico fragment using backtracking to find feasible matches, that is, those that satisfy Eq. 1 and minimize the total number of cut errors (#cuterrors= missing cuts + false cuts + missing fragments found), with ties being broken by minimizing a χ² function for fragment sizing errors:

$$ {Score}_{s, t} = \min_{k \le s, l \le t} C_{ce} (s - k + t - l) + {\chi}_{k..s, l..t}^{2} + {Score}_{k-1, l-1}, $$

((2))

where the first index of each subscript represents experimental fragments, the second index represents the in silico fragments, s−k is the number of false cuts, t−l is the number of missing cuts, C_ce is a constant larger than the maximum possible total for χ², ${\chi }_{k..s, l..t}^{2} = \left (\sum \limits _{i=k}^{s} o_{i} - \sum \limits _{j=l}^{t} r_{j}\right)^{2} / \left (\sum \limits _{j=l}^{t} {\sigma _{j}^{2}}\right)$ and Score_0,0=0, Score_i,0=∞ and Score_0,j=∞.

Note that a small in silico fragment is considered as missing if this condition allows for a valid alignment that improves the local χ² on nearby matches by half (up to three consecutive fragments).

As in [16], we band the dynamic programming and its backtracking to avoid unnecessary computation. In particular, as we show in Supplementary Note 1 in Additional file 1, based on parameter estimates for optical mapping data, restricting alignments to eight missing cuts or five false cuts, consecutively, should retain high sensitivity. In addition, we stop the dynamic programming-based extension if no feasible solutions can be found for the current seed after having analyzed at least f fragments (default of five).

The computational cost of extending a seed (c-tuple) of an experimental map with m fragments is thus, in the worst case, O((m−c) δ³) time, where δ is the bandwidth of the dynamic programming, and O((m−c)²) space for allocating the dynamic programming matrix for each side of the seed.

Statistical significance and uniqueness analysis

To evaluate the statistical significance of a candidate alignment, we exploit the fact that we have explored the space of feasible alignments in our search and use these alignments to approximate a random sample from a (conservative) null model. The assumption here is that there is only one true alignment and that, therefore, the population of these sub-optimal alignments can provide a conservative null model for evaluating the significance of an alignment; more specifically, for each candidate alignment found, we compute its distance from the null model in a feature space (to be defined later) using a Z-score transformation and then use this score to evaluate whether it is optimal, statistically significant and unique (see Fig. 3 for an example).

We start by identifying a set F of features, independent with respect to false positive (or random) alignments, that are expected to follow the normal distribution (for example, using the central limit theorem) and be comparable between different alignments of the same experimental map, and we compute a Z-score for each feature f∈F and for each candidate solution π∈Π identified through the seeding method. Each Z-score takes into account the mean and standard deviation of a particular feature f among all candidate alignments found:

$$ {Z-score}(\mathrm{\pi} \in \mathrm{\Pi}, f) = \frac{f_{\mathrm{\pi}} - {Mean}(f_{\mathrm{\Pi}})}{{SD}(f_{\mathrm{\Pi}})}. $$

((3))

Accounting for all considered features f_i, with 1≤i≤k and k≥2, the resulting score is given by:

$$ \vartheta(\mathrm{\pi} \in \mathrm{\Pi}) = {Z-score}\left(\sum_{i} s_{i} \times {Z-score}(\mathrm{\pi}, f_{i})\right), $$

((4))

where s_i=1 if lower values of feature f_i are preferable and −1 otherwise, and the corresponding p-value is p_π=Pnorm(𝜗(π)), that is, the probability that a ‘random’ Z-score will be less than 𝜗(π) under the standard normal distribution.

In our case, we chose a set of features based on the number of matches (#matches), the total number of cut errors and the Wilson-Hilferty transformation (WHT) of the χ² score for sizing errors (which converges quickly to a standard normal distribution):

$$ \textrm{WHT}\left({\chi}^{2}, {\# matches}\right) = \frac{\sqrt[3]{\frac{{\chi}^{2}}{{\# matches}}} - \left(1 - \frac{1}{9} \frac{2}{{\# matches}}\right)}{\sqrt{\frac{1}{9} \frac{2}{{\# matches}}}}. $$

((5))

Note that this set can be shown to be composed of approximately independent features for false positive alignments (Z-score pairwise covariances between all features did not show a significant alteration of the final Z-scores when accounting for them in all of our simulations). By the central limit theorem, the mean of the first two features can be approximated by the normal distribution when the number of candidate solutions is large enough (typically, greater than 60 distinct solutions), and, by Slutsky’s theorem, their sample variance would not have a significant effect on the distribution of the test statistics. As lower values of #cuterrors and WHT(χ²,#matches) indicate better solutions, while a higher number of matches represents more reliable alignments, we need to adjust the signs of their Z-scores accordingly. The final Z-score 𝜗(π) computed for each candidate solution π is therefore given by:

$$ \begin{aligned} \vartheta(\mathrm{\pi} \in \mathrm{\Pi}) &= {Z-score}(-{Z-score}(\mathrm{\pi}, {\# matches})\\ &\quad+ {Z-score}(\mathrm{\pi}, {\# cut errors})\\ &\quad+ {Z-score}(\mathrm{\pi}, \textrm{WHT}({\chi}^{2}, {\# matches}))\,), \end{aligned} $$

((6))

which can be subsequently converted into the p-value p_π. The candidate solution π^∗ with the lowest p-value p^∗ is reported as the optimal solution, as shown in Fig. 3.

The statistical significance of the optimal solution can then be assessed through a false discovery rate q-value analysis [22] based on all candidate solutions found for comparable experimental maps, for example, those with the same number of fragments (default of q=0.01). To assess uniqueness, we set a threshold on the ratio of p-values between the best solution and the next-best solution (default of five). See Supplementary Note 2 in Additional file 1 for further algorithmic details.

In summary, our statistical scoring approach finds an optimal solution and evaluates its statistical significance and uniqueness in a unified framework, thus allowing for savings in computational time and space compared to a permutation test, without restricting the method to a scenario where experimental error probabilities are known a priori.

Extension to overlap alignment

To extend OPTIMA to compute and evaluate overlap alignments – a key step in assembly pipelines that use mapping data [4, 5, 23] – we use a sliding-window approach based on OPTIMA. This allows us to continue using the statistical evaluation procedure defined in OPTIMA that relies on learning parameters from comparable alignments (that is, those with the same number, size and order of experimental fragments) in a setting where the final alignments are not always of the same length and structure.

Briefly, for each map, OPTIMA-Overlap first finds optimal sub-map alignments, evaluates their significance and uniqueness, and then tries to extend the candidate alignments found until it reaches the ends of either the experimental map or the in silico map, in order to choose the most significant overlap alignments (see Fig. 1 e). This approach begins by dividing an experimental map into sub-maps of length l with a sliding window and then glocally aligning them to in silico maps using OPTIMA (again allowing for truncated ends to account for high error rates). Each glocal alignment sub-problem will then return either:

(i)
a significant and unique sub-map alignment;
(ii)
an alignment labeled as non-significant and/or non-unique (which will be considered as a false alignment); or
(iii)
no feasible alignments found.

Optimal solutions to the sub-problems are then ranked by p-value (smallest to largest) and iterated through to select sub-maps that should be extended. At each stage we check the significance and uniqueness of the reported solutions (compared to the others), as well as checking for potential cases of identical or conflicting alignments as defined below.

Definition 5 (Conflicting alignments).

A sub-map alignment π₁ is said to be conflicting with another alignment π₂ if either:

the sub-map of π₁ overlaps the sub-map of π₂; or
π₁ aligns to the same in silico map as π₂, but in a different location or strand.

Conflicting alignments can result in ambiguous placement of an experimental map on a database of in silico maps, but condition (a) could be relaxed in some cases, for example, when experimental maps are known to overlap multiple in silico maps in the same region. Therefore, while iterating through the list of sub-maps, the following rules are implemented:

1.
Significance: if the current solution π_i is labeled as a false alignment, then we stop iterating through the rest of the list.
2.
Uniqueness: we skip an alignment if either: (i) π_i represents the same overlap alignment as a more significant solution; (ii) π_i is conflicting with a solution with a lower p-value (that is, seen before); or (iii) π_i is not unique with respect to other solutions π_j with j>i (that is, having greater p-values) that it is conflicting with.
3.
Extension with dynamic programming: optimal overlap solutions are identified according to Eq. 2, where ties are broken in favor of longer valid alignments.

This approach allows us to report multiple overlap alignments (including split alignments) for an experimental map while using the q-value analysis, as before, to report all alignments with q≤0.01. For the q-value analysis, we consider all candidate solutions found for the sliding windows in order to learn the q-value parameters. In addition, we can reuse the dynamic programming matrix computed for each seed across sub-map alignments and thus complete the overlap alignment with the same asymptotic time and space complexity as the glocal alignment.

Results and discussion

Generation of benchmarking datasets

To benchmark OPTIMA and OPTIMA-Overlap against other state-of-the-art map aligners, we first developed synthetic datasets that aim to represent two ends of the spectrum of errors in mapping data for eukaryotic genomes. These scenarios were defined by confidently aligning (using SOMA [10] and manual curation) two sets of maps from different experimental runs for optical mapping (using the Argus system from OpGen) on a human cell line. The first scenario, (A), was defined based on lanes that were reported by the Argus machine to have high quality scores, while the second scenario, (B), was defined by lanes with the lower qualities that were typically obtained on the system. We estimated three key parameters from the data: d, the average restriction enzyme digestion rate; f₁₀₀, the average false cut rate per 100 kbp; and the fragment sizing errors for predefined (reference) in silico fragment size ranges (these were fixed for both scenarios and recorded as relative deviations of the empirical sizes from the reference sizes):

Easier scenario: d=0.78 (corresponding to missing cut rate of 22 %); f₁₀₀=0.97; and probability at 0.5 for missing fragments of size below 1.2 kbp, 0.75 below 600 bp and 1 below 350 bp.
Harder scenario: d=0.61 (corresponding to missing cut rate of 39 %); f₁₀₀=1.38; and 0.5 for missing fragments of size below 2 kbp, 0.75 below 800 bp and 1 below 350 bp.

For each scenario, we first simulate the map sizes using empirically derived distributions from real maps (average size of approximately 275 kbp, containing 17 fragments) and extract the corresponding reference sub-maps by sampling start locations uniformly from the in silico maps (possibly creating truncated end fragments). Then we introduce cut errors using the probability distributions described in [13] with the above parameters, that is: first, we remove missing cuts following a Binomial distribution with probability 1−d; next, we insert false cuts modeled as a Poisson process with rate f₁₀₀ (avoiding creation of small fragments less than 1.2 kbp in size); and finally, we remove small fragments with the probabilities described above. Sizing errors are introduced by sampling from the empirical errors found for each range of reference fragment sizes. Simulated experimental maps smaller than 150 kbp or with fewer than ten fragments are discarded, mimicking the pre-processing stage on the Argus system.

We generated 100 times greater coverage of maps with errors for the Drosophila melanogaster (BDGP 5) and Homo sapiens (hg19/GRCh37) genomes using the KpnI restriction pattern GGTAC’C, where the apostrophe indicates the position of the cut, which resulted in 13,920 fragments genome-wide (forward and reverse strands) with an average fragment size (AFS) of 17.3 kbp and 573,276 fragments with AFS = 10.8 kbp, respectively.

Glocal alignment results

OPTIMA was compared against the state-of-the-art algorithms Gentig v.2 (alignment module) [16, 17, 24], SOMA v.2 [10] and Valouev’s likelihood-based fit alignment [13] for glocally aligning the simulated maps over their respective in silico reference genomes. TWIN [19] was not included in this comparison as it does not allow for errors and missing information in experimental maps.

We also ran variations of these algorithms from their default options (d): specifically, by providing the true error distribution parameters used in the simulations as input (tp), the adjusted AFS based on the organism under analysis (a) and parameter values published in the respective papers (instead of the software’s default ones), to provide, in addition, the true error distribution rates (p); and by allowing the trimming of map ends in the alignment (t). Moreover, SOMA [10] was modified to correctly handle missing in silico fragments up to 2 kbp, to run only for C_σ=3, to make its results comparable, and by inverting the role of in silico–experimental input maps (v). We omitted SOMA’s statistical test (also for Valouev’s likelihood method, where it is not enabled by default), because it is unfeasible for large datasets [19], and applied only its uniqueness test (F-test). Further details about the running parameters are provided in Supplementary Note 3 in Additional file 1. OPTIMA alignments were performed on both strands of the in silico maps, without trimming end fragments or removing any small in silico fragments.

As can be seen from the results in Table 1, OPTIMA reports alignments with very high precision, greater than 99 % in most cases, independent of the genome size and the dataset error rate. In comparison, Gentig has similarly high precision on the Drosophila genome but lower precision on the human genome, with as low as 80 % precision under scenario (B) (with default parameters). Without their computationally expensive statistical tests, which can increase the runtime by a factor of greater than 100, SOMA and the likelihood-based method have much lower precision, particularly on the human genome. In addition, in terms of sensitivity, OPTIMA was found to be notably better than other aligners, even when the true error distribution rates were provided as input to these algorithms. In particular, for the higher quality scenario (A), OPTIMA is more than 1.5 times as sensitive as Gentig, and for the commonly obtained scenario (B), OPTIMA is more than twice as sensitive as Gentig.

Table 1 Comparison of all methods and their variants on glocal map-to-sequence alignment

Full size table

These results are further broken down in Fig. 4 as a function of the number of fragments in the experimental maps, showing that OPTIMA uniformly achieves more than twice the sensitivity for the smaller maps that are frequently obtained in real datasets. The relatively higher sensitivities of SOMA and the likelihood-based method in these experiments are likely to be artifacts of relaxed settings in the absence of their statistical tests. These results highlight OPTIMA’s high precision and improved sensitivity across experimental conditions and suggest that it could adapt well to other experimental settings.

In Table 2, we further compare all methods on their running time and worst-case complexity (runtime and space). It can be seen that SOMA and the likelihood-based methods are at least an order of magnitude slower than OPTIMA and Gentig. Gentig’s proprietary algorithm is based on work that has been previously published [17, 24], but its current version uses an unpublished hashing approach. In comparison, OPTIMA is two times faster while being more than 50 % more sensitive than Gentig.

Table 2 Running time and worst-case complexity for various glocal map-to-sequence aligners

Full size table

Real data analysis and comparison

To assess OPTIMA’s performance on real data we generated, in-house, 309,879 and 296,217 optical maps for two human cell lines, GM12878 and HCT116, respectively, using the Argus system from OpGen [4, 25] (with the KpnI enzyme and ten map cards in total), and glocally aligned them over the human reference genome. Supplementary Note 4 in Additional file 1 provides the sizing error statistics.

Tables 3 and 4 report statistics of the alignments for raw maps filtered under two settings:

(r)
relaxed filtering, which filters maps with fewer than ten fragments and smaller than 150 kbp;
Table 3 Statistics for glocal alignment of real human optical maps from GM12878 HapMap cell line
Full size table
Table 4 Statistics for glocal alignment of real human optical maps from HCT116 colorectal cancer cell line
Full size table
(s)
stringent filtering (as suggested in [4]), which filters maps with fewer than 12 fragments and smaller than 250 kbp.

The statistics were reported independently for each map card to capture the variability in terms of quality. In total, OPTIMA, with a stringent uniqueness threshold of 30, confidently aligned nearly three times as many maps as Gentig (with default parameters) for GM12878. Similarly, for HCT116, OPTIMA results were 1.7 times better than Gentig results, and corresponding improvements were also obtained using the stringently filtered datasets.

Further analyzing the error rates in the maps that OPTIMA confidently aligned (Tables 3 and 4), we observed that the overall statistics for average digestion rate d, false/extra cut rate f₁₀₀ and sizing errors were found to be similar to those obtained using scenario (B) (see Supplementary Note 5 in Additional file 1).

Overlap alignment results

For overlap alignment, we compared OPTIMA-Overlap with an overlap-finding extension of Gentig v.2 (implemented in the commercial software Genome-Builder from OpGen, which contains a module called SCAFFOLDEXTENDER) [17, 24], as well as with Valouev’s likelihood-overlap method [13].

In our first test, we randomly selected 1000 maps for each scenario (A) and (B) from our previously simulated maps for Drosophila (BDGP 5) and human (hg19/GRCh37) genomes. In addition, we simulated assembled sequence fragments for these genomes based on empirically derived scaffold size distributions (Drosophila assembly N50 of 2.7 Mbp with 239 scaffolds and human assembly N50 of 3.0 Mbp with 98,987 scaffolds); the simulated assemblies were used to generate in silico maps (filtered for those with fewer than four non-end fragments, because these cannot be confidently aligned [14, 15]), which were then aligned with the simulated experimental maps.

For our second test, we compared all methods on optical mapping data generated in-house from the K562 human cancer cell line on OpGen’s Argus system, where a random sample of 2000 maps with at least ten fragments was extracted. In silico maps were generated from de novo assemblies of shotgun Illumina sequencing data (HiSeq) and six mate-pair libraries with insert sizes ranging from 1.1 kbp to 25 kbp [26] (the final assembly had an N50 of 3.1 Mbp and 76,990 scaffolds in total, using SOAPdenovo v.1.05 [27] with a k-mer size of 51 for contig assembly and Opera v.1.1 [28] for scaffolding with mate pairs). It is likely that this dataset represents a harder scenario, with assembly gaps/errors and genomic rearrangements confounding the analysis. It also represents a likely use case where mapping data will be critical to detect large structural variations, disambiguate complex rearrangements and, ultimately, assemble cancer genomes de novo.

For each test, we evaluated the precision of alignments as well as the number of (correctly) reported alignments that provide an extension to the in silico maps through experimentally determined fragments, as this is key for the application of overlap alignments in genome assembly. We begin by noting that there is an important trade-off between sensitivity with a specific window size in OPTIMA-Overlap and the correctness of alignments, as can be seen in Fig. 5. As expected, even though small window sizes (less than ten in Fig. 5) provide more sensitive results, they also make true alignments indistinguishable from noise and reduce the number of correct alignments detected; on the other hand, larger window sizes improve the signal-to-noise ratio but result in a drop in sensitivity. This leads to a sweet spot in the middle (10–13 fragments) where the method is most sensitive across a range of datasets. In particular, real datasets are slightly more challenging than our simulations (see human (B) compared to real data in Fig. 5) and so we have conservatively chosen a window size of 12 as the default for OPTIMA-Overlap. By benchmarking OPTIMA-Overlap with this setting, we observed high precision similar to that observed with OPTIMA for glocal alignment (Table 5). This was seen uniformly across datasets with disparate profiles in terms of genome size and error rates, suggesting that our statistical evaluation is reasonably robust. As before, we also note that Gentig’s approach and the likelihood-based method might not always exhibit high precision. Finally, in terms of sensitivity, OPTIMA-Overlap shows a 30–80 % improvement over competing approaches, and this is also seen in the harder real datasets.

Table 5 Comparison of methods for overlap map-to-sequence alignment

Full size table

Utility in real-world applications

Overlap alignments form a critical building block for applications such as OpGen’s Genome-Builder and its use in boosting assembly quality [4]. As OPTIMA-Overlap can work with lower quality data (scenario (B) in our simulations; Genome-Builder would typically filter out such data) and also provide improved sensitivity in detecting overlap alignments, we estimate that its use could reduce the requirement for generating mapping data by half. As the cost of mapping data for the assembly of large eukaryotic genomes can range from USD 20,000 to 100,000, this can lead to significant cost savings.

We similarly compared OPTIMA and Gentig on the two human cell line results, shown in Tables 3 and 4, in order to calculate how much mapping data would be needed for sufficient aligned coverage of the human genome to enable structural variation analysis. By analyzing the alignment rate increase of OPTIMA compared to Gentig, a 1.5 to 2.9 times increase on average, we computed the corresponding cost reduction to be 33–66 %, with an average cost reduction of 54 % for relaxed filtering of data (r) and 49 % for stringent filtering (s). These results suggest that for structural variation analysis on the human genome, particularly for cancer genomes, OPTIMA can significantly reduce project costs (in the tens of thousands of dollars) while enabling faster analyses of the data.

Conclusion

With the availability of new mapping technologies (for example, Nabsys) and greater use of existing ones to complement high-throughput sequencing, there is a critical need for robust computational tools that can combine genomic mapping and sequence data efficiently. In this work, we introduce two new alignment tools that address this need for a wide range of applications, from genome assembly to structural variation analysis. Our benchmarking results provide evidence that these methods outperform existing approaches in sensitivity and runtime while providing highly precise alignments in a range of experimental settings. Similar results are also seen in real datasets from human cell lines, suggesting that they could help in significantly reducing the cost of optical mapping analysis and thus increase its usage.

In the development of OPTIMA and OPTIMA-Overlap, we establish two key new ideas for map alignment. The first is the introduction of composite seeds, an idea that echoes the idea of spaced seeds in the context of continuous-valued sequence alignment. Composite seeds allow us to develop efficient seed-and-extend aligners for map-to-sequence alignment of erroneous mapping data. We believe that similar ideas can be applied for map-to-map alignment and de novo assembly of experimental maps. The second concept is the development of a conservative statistical testing approach that does not require knowledge of the true distribution of errors or an expensive permutation test to evaluate the uniqueness and significance of alignments. This allows us to significantly reduce the runtime cost of this step without sacrificing specificity or the ability to be agnostic with respect to error rates. Although our experiments with real data in this work were limited to data generated on the Argus system from OpGen, similar ideas (with minor variations) should also be applicable to data generated by other technologies such as the Irys platform from BioNano Genomics.

In future work, we plan to implement further runtime and memory optimizations to OPTIMA and OPTIMA-Overlap and to explore their use for super-scaffolding of large genomes as well as for studying genomic rearrangements in cancer.

Availability and requirements

Project name: OPTIMA: Index-based map-to-sequence alignment in large eukaryotic genomes
Project home page:http://www.davideverzotto.it/research/OPTIMA, https://github.com/verznet/OPTIMA
Operating system: Platform independent
Programming language: Java 7+
Other requirements: Java Development Kit 7+, Apache Commons Math 3.2, CERN Colt 1.2.0
License: LGPL (Lesser General Public License) 2.1, OSI compliant
Any restrictions to use by non-academics: none

Availability of supporting data

Snapshots of the code and benchmarking and real datasets are available from the GigaScience GigaDB database [29, 30].

References

OpGen Inc. 2015. Available from: http://www.opgen.com. Accessed 9 Dec 2015.
BioNano Genomics Inc. 2015. Available from: http://www.bionanogenomics.com. Accessed 9 Dec 2015.
Nabsys Inc. 2015. Available from: http://www.nabsys.com. Accessed 9 Dec 2015.
Dong Y, Xie M, Jiang Y, Xiao N, Du X, Wang W, et al.Sequencing and automated whole-genome optical mapping of the genome of a domestic goat (Capra hircus). Nat Biotechnol. 2013; 31:135–41.
Article CAS PubMed Google Scholar
Ganapathy G, Howard J, Ward J, Zhang G, Phillippy A, Jarvis E, et al.High-coverage sequencing and annotated assemblies of the budgerigar genome. GigaScience. 2014; 3:11.
Article PubMed PubMed Central Google Scholar
Lam ET, Hastie A, Lin C, Ehrlich D, Nagarajan N, Kwok PY, et al.Genome mapping on nanochannel arrays for structural variation analysis and sequence assembly. Nat Biotechnol. 2012; 30:771–6.
Article CAS PubMed Google Scholar
Ray M, Goldstein S, Zhou S, Potamousis K, Sarkar D, Schwartz D, et al.Discovery of structural alterations in solid tumor oligodendroglioma by single molecule analysis. BMC Genomics. 2013; 14:505.
Article PubMed PubMed Central Google Scholar
Waterman MS, Smith TF, Katcher H. Algorithms for restriction map comparisons. Nucleic Acids Res. 1984; 12:237–42.
Article CAS PubMed PubMed Central Google Scholar
Mendelowitz L, Pop M. Computational methods for optical mapping. GigaScience. 2014; 3:33.
Article PubMed PubMed Central Google Scholar
Nagarajan N, Read TD, Pop M. Scaffolding and validation of bacterial genome assemblies using optical restriction maps. Bioinformatics. 2008; 24:1229–35.
Article CAS PubMed PubMed Central Google Scholar
Lin H, Goldstein S, Mendelowitz L, Zhou S, Schwartz D, Pop M, et al.AGORA: assembly guided by optical restriction alignment. BMC Bioinforma. 2012; 13:189.
Article Google Scholar
Antoniotti M, Anantharaman T, Paxia S, Mishra B. Genomics via optical mapping IV: sequence validation via optical map matching. New York, NY, USA: New York University; 2001.
Google Scholar
Valouev A, Li L, Liu YC, Schwartz DC, Yang Y, Waterman MS, et al.Alignment of optical maps. J Comput Biol. 2006; 13:442–62.
Article CAS PubMed Google Scholar
Anantharaman TS, Mishra B. A probabilistic analysis of false positives in optical map alignment and validation. In: First international workshop on algorithms in bioinformatics (WABI). Aarhus, Denmark: Springer: 2001. p. 27–40.
Google Scholar
Sarkar D, Goldstein S, Schwartz DC, Newton MA. Statistical significance of optical map alignments. J Comput Biol. 2012; 19:478–92.
Article CAS PubMed PubMed Central Google Scholar
Anantharaman TS, Mishra B, Schwartz DC. Genomics via optical mapping II: ordered restriction maps. J Comput Biol. 1997; 4:91–118.
Article CAS PubMed Google Scholar
Anantharaman TS, Mishra B, Schwartz DC. Genomics via optical mapping III: contiging genomic DNA. In: Proceedings of the 7th international conference on intelligent systems for molecular biology (ISMB 1999). Heidelberg, Germany: AAAI: 1999. p. 18–27.
Google Scholar
Karp RM, Rabin MO. Efficient randomized pattern-matching algorithms. IBM J Res Dev. 1987; 31:249–60.
Article Google Scholar
Muggli MD, Puglisi SJ, Boucher C. Efficient indexed alignment of contigs to optical maps. In: Algorithms in Bioinformatics (WABI 2014). vol. 8701 of LNCS. Wrocław, Poland: Springer: 2014. p. 68–81.
Google Scholar
Verzotto D, Teo ASM, Hillmer AM, Nagarajan N. Index-based map-to-sequence alignment in large eukaryotic genomes. In: Proceedings of the fifth RECOMB satellite workshop on massively parallel sequencing (RECOMB-Seq 2015). Warsaw (Poland): Cold Spring Harbor Labs Journals: 2015. p. 1–11. doi:10.1101/017194.
Google Scholar
Califano A, Rigoutsos I. FLASH: a fast look-up algorithm for string homology. In: Proceedings of the 1st International Conference on Intelligent Systems for Molecular Biology (ISMB 1993). Bethesda, MD, USA: AAAI: 1993. p. 56–64.
Google Scholar
Storey JD, Tibshirani R. Statistical significance for genomewide studies. Proc Nat Acad Sci. 2003; 100:9440–5.
Article CAS PubMed PubMed Central Google Scholar
Kawahara Y, de la Bastide M, Hamilton J, Kanamori H, McCombie R, Matsumoto T, et al.Improvement of the Oryza sativa Nipponbare reference genome using next generation sequence and optical map data. Rice. 2013; 6:4.
Article PubMed PubMed Central Google Scholar
Anantharaman TS, Mysore V, Mishra B. Fast and cheap genome wide haplotype construction via optical mapping In: Altman RB, Jung TA, Klein TE, Dunker AK, Hunter L, editors. Proceedings of the 10th Pacific Symposium on Biocomputing (PSB 2005). Hawaii, USA: World Scientific: 2005. p. 1–16.
Google Scholar
Teo ASM, Verzotto D, Yao F, Nagarajan N, Hillmer AM. Single-molecule optical genome mapping of a human HapMap and a colorectal cancer cell line. GigaScience. 2015; 4:65. doi:10.1186/s13742-015-0106-1.
Article PubMed PubMed Central Google Scholar
Yao F, Ariyaratne PN, Hillmer AM, Liu ET, Ruan Y. Long span DNA Paired-End-Tag (DNA-PET) sequencing strategy for the interrogation of genomic structural mutations and fusion-point-guided reconstruction of amplicons. PLOS ONE. 2012; e46152:7.
Google Scholar
Li R, Zhu H, Ruan J, Wang J, Fang X, Wang J, et al.De novo assembly of human genomes with massively parallel short read sequencing. Genome Res. 2010; 20:265–72.
Article CAS PubMed PubMed Central Google Scholar
Gao S, Sung WK, Nagarajan N. Opera: Reconstructing Optimal Genomic Scaffolds with High-Throughput Paired-End Sequences. J Comput Biol. 2011; 18:1681–91.
Article CAS PubMed PubMed Central Google Scholar
Verzotto D, Teo ASM, Hillmer AM, Nagarajan N. Supporting software for OPTIMA, a tool for sensitive and accurate whole-genome alignment of error-prone genomic maps by combinatorial indexing and technology-agnostic statistical analysis; 2015. GigaScience Database. doi:10.5524/100165.
Google Scholar
Teo ASM, Verzotto D, Yao F, Nagarajan N, Hillmer AM. Supporting single-molecule optical genome mapping data from human HapMap and colorectal cancer cell lines; 2015. GigaScience Database. doi:10.5524/100182.
Google Scholar

Download references

Acknowledgments

We would like to thank Juntao Li for useful discussions on the statistical analysis and Denis Bertrand and Burton KH Chia for valuable feedback and comments on a draft of this manuscript. We thank Fei Yao for help on experiments and troubleshooting of the Argus system and Chee Seng Chan and Lavanya Veeravalli for post-processing the optical mapping data.

Author information

Authors and Affiliations

Computational and Systems Biology, Genome Institute of Singapore, 60 Biopolis Street, Singapore, 138672, Singapore
Davide Verzotto & Niranjan Nagarajan
Cancer Therapeutics and Stratified Oncology, Genome Institute of Singapore, 60 Biopolis Street, Singapore, 138672, Singapore
Audrey S. M. Teo & Axel M. Hillmer

Authors

Davide Verzotto
View author publications
You can also search for this author in PubMed Google Scholar
Audrey S. M. Teo
View author publications
You can also search for this author in PubMed Google Scholar
Axel M. Hillmer
View author publications
You can also search for this author in PubMed Google Scholar
Niranjan Nagarajan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Davide Verzotto or Niranjan Nagarajan.

Additional information

Competing interests

This work was partly supported under a research collaboration agreement with Sciencewerke Pte. Ltd., the Singapore distributor for OpGen Inc. No employees of Sciencewerke or OpGen played a role in the work described in this manuscript. DV and NN are inventors on a patent application related to this work.

Authors’ contributions

The overall project was conceived and planned by AMH and NN, while DV and NN defined sub-projects of this work. ASMT carried out all experiments, to generate the optical mapping data used in this manuscript, under the guidance of AMH. DV designed and implemented the algorithms and conducted the analysis under the guidance of NN. DV and NN wrote the manuscript and all authors read and approved the final manuscript.

Additional file

Additional file 1

Supplementary material. Supplementary notes and figures to accompany the main manuscript. (PDF 32358 kb)

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Cite this article

Verzotto, D., M. Teo, A.S., Hillmer, A.M. et al. OPTIMA: sensitive and accurate whole-genome alignment of error-prone genomic maps by combinatorial indexing and technology-agnostic statistical analysis. GigaSci 5, 2 (2016). https://doi.org/10.1186/s13742-016-0110-0

Download citation

Received: 09 May 2015
Accepted: 06 January 2016
Published: 19 January 2016
DOI: https://doi.org/10.1186/s13742-016-0110-0

OPTIMA: sensitive and accurate whole-genome alignment of error-prone genomic maps by combinatorial indexing and technology-agnostic statistical analysis

Abstract

Background

Results

Conclusions

Background

Definitions and problem formulation

Definition 1 (Feasible match).

Definition 2 (Glocal alignment).

Definition 3 (Overlap alignment).

Methods

Indexing continuous-valued seeds

Definition 4 (Composite seed).

Dynamic programming-based extension of seeds

Statistical significance and uniqueness analysis

Extension to overlap alignment

Definition 5 (Conflicting alignments).

Results and discussion

Generation of benchmarking datasets

Glocal alignment results

Real data analysis and comparison

Overlap alignment results

Utility in real-world applications

Conclusion

Availability and requirements

Availability of supporting data

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding authors

Additional information

Competing interests

Authors’ contributions

Additional file

Additional file 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

GigaScience

Contact us