This article has Open Peer Review reports available.
Image processing for optical mapping
© Ravindran and Gupta. 2015
Received: 19 February 2015
Accepted: 5 November 2015
Published: 26 November 2015
Optical Mapping is an established single-molecule, whole-genome analysis system, which has been used to gain a comprehensive understanding of genomic structure and to study structural variation of complex genomes. A critical component of Optical Mapping system is the image processing module, which extracts single molecule restriction maps from image datasets of immobilized, restriction digested and fluorescently stained large DNA molecules. In this review, we describe robust and efficient image processing techniques to process these massive datasets and extract accurate restriction maps in the presence of noise, ambiguity and confounding artifacts. We also highlight a few applications of the Optical Mapping system.
Optical Mapping [1–3] is a high-throughput, single-molecule system that generates ordered restriction maps (also called Rmaps) from high molecular weight genomic DNA molecules, ranging in size from 300 kilobases to a few megabases. The Rmaps are then used for the construction of genome-wide physical restriction maps using computational approaches, which provide insights into long range genome structure and genome variation. Optical mapping is made possible by the integration of many diverse components that draw from surface chemistry, microfluidics, fluorescence microscopy, image processing and other computational approaches. The physical maps generated using Optical Mapping have served as scaffolds to guide and/or validate DNA sequencing based genome assemblies [4–7]. More recently, Optical Mapping, because of its ability to resolve repeat rich and other low complexity genomic loci, has been used to identify structural polymorphisms in normal human genomes  and structural variants in disease-risk  and cancer genomes [9, 10].
The elongated and immobilized DNA molecules are digested with a restriction endonuclease of choice. Upon digestion, the double-stranded DNA digestion sites present as gaps that are formed between fragments due to DNA relaxation at cut ends [1, 12]. Next, digested DNA is stained using intercalating fluorochrome YOYO-1  and imaged using automated laser-illuminated epifluorescence microscopy systems [2, 15–17]. Custom in-house software allows automated imaging of an entire array of microchannels with very little setup time. Once the images have been collected, they are automatically processed using custom image processing software to generate Rmaps, which are obtained as ordered series of fragment sizes derived from digested single DNA molecules [2, 3]. Once a large dataset of Rmaps has been collected using the Optical Mapping system, a computational pipeline that uses Bayesian inference approaches  and cluster computing is used to assemble the Rmaps into genome-wide contigs and generate genome-wide consensus maps [3, 19–21].
The description above highlights that the image processing module acts as a filter/bridge within the Optical Mapping pipeline that extracts the useful essence, the Rmaps, from massive optical microscopy datasets. Image processing is a critical contributor to successful implementation of Optical Mapping and works in synchrony with the other components of the system. The image processing module is the central focus of this manuscript.
Image processing methodology
Skeletonization: The single pixel centerline for each fragment is detected as a column ordered connected component or skeletal segment.
Tiling: A single Rmap can span multiple images. In order to extract multi-frame Rmaps, a mosaic of images acquired from a single microchannel is created by aligning adjacent images using the skeletal segments.
Grouping: The skeletal segments are grouped such that each group corresponds to fragments from the same DNA molecule.
Sizing: Groups of skeletal segments that correspond to standards are detected; conversion factors for these skeletal segments are computed using integrated fluorescence intensity and the known size of fragments for the standards. The estimated conversion factor is applied to construct Rmaps (in kilobases) for genomic DNA molecules.
Different versions of image processing software for Optical Mapping have been implemented over the last two decades. During the early days of Optical Mapping useful map data was obtained using completely manual methods for detecting and sizing the fragments. Improvements to the image acquisition  and image processing software  culminated in the development of the semi-automatic Autovis system . Lim and coworkers described Semi-Autovis in , where they used it to generate R-maps for the E. coli genome. As the name suggests, Semi-Autovis was a semi-automatic image processing system; it required user identification of the approximate location of suitable molecules. Once such locations were identified, Semi-Autovis handled skeletonization, grouping and sizing automatically. This system also dealt with crossing molecules, bright spots near molecules and other object imperfections, which was not possible with prior image processing systems. For E. coli, a total of 840 R-maps were collected (494 with XhoI; 346 with NheI), of which 471 were included in the final contigs (251 for XhoI and 220 for NheI), reflecting a contig rate of 56 %. Although Semi-Autovis was much faster than previous systems, there was clear need for a completely automated image processing system for Optical Mapping of larger genomes. PathFinder [2, 17] was the first fully featured, automated image processing system developed for Optical Mapping and was instrumental is making large scale Optical Mapping projects feasible. The image processing methodology detailed below is inspired by the techniques implemented in the PathFinder system.
While the conditions in Eqs. 1 and 2 maintain connectedness of the skeletal pixels in a segment, skeletons that are not one pixel wide may also be produced. Hence, for each segment, the one pixel wide skeleton is extracted as the shortest path between the segment end points using Dijkstra’s algorithm . The accurate localization of end points is aided by the increased intensity pixels due to coil relaxation at enzyme cleavage sites. An example extracted skeleton with end points is shown in Fig. 3(c).
The image acquisition system captures multiple overlapping images along each microchannel. Accordingly, long DNA molecules that span several frames are imaged, which necessitates tiling. As a linear stage is employed to acquire overlapping images, the geometric transformation that is used to model the tiling of adjacent images is a translation.
A skeletal segment can belong to utmost one group.
Adjacent skeletal segments in the same group have no overlap.
For segments S l and S r , the ordered grouping (S l ,S r ) is valid if S r is the best segment to pair with S l when “growing” S l on its right and vice versa. When determining the best segment both spatial proximity and orientational similarity of the segments (via straight line fits) are used.
The two main factors that influence fragment sizing are: (i) intensity fluctuations due to local variations in the elongation of the DNA molecule or staining, and (ii) regions of increased gray level intensity adjacent to enzyme cleavage sites due to coil relaxation. Fragment sizes obtained using integrated fluorescence values are robust to these local effects. In order to convert the integrated fluorescence values into kilobases, standards are used to used to estimate the conversion factor.
The skeletonization technique presented here robustly detects each fragment as a single skeletal segment. This technique can be easily adapted to other optical single molecule platforms such as nanocoding  and Irys  by evaluating the skeletonization conditions (Eqs. 1–4) in a direction perpendicular to the dominant direction of the presented molecules.
The spatial proximity and orientation similarity parameters that are used for grouping adjacent skeletal segments are empirically derived. For two skeletal segments to be grouped, we typically require spatial proximity to be less than 9 pixels (at 100 nm per pixel) and orientation difference to be less than 15 degrees. Higher enzyme restriction density can confound these thresholds as smaller fragments can “float” away and may not be ideally localized for Rmap grouping. In such cases ambiguities in grouping are handled using bioinformatics filters . It should be noted that perfect handling of this situation is highly non-trivial; however when intact molecules (without restriction induced fragments) are localized in nanoslits ([23, 24, 26]), grouping becomes trivial.
Uncertainties in sizing are caused by variations in image intensities, ambiguities in localizing end points and distracting elements that can intersect molecules. The integrated fluorescence based sizing is highly resilient to the first two sources of uncertainties. Nearby and intersecting distractors are handled by “flagging” the affected fragment(s) in the Rmap and addressed using bioinformatics .
We highlight the effectiveness of the optical mapping system in providing detailed characterization of structural variants at the single molecule level using two exemplar large scale studies ([3, 10]). For the 4 human genomes that were studied in , over 95 % of fragments (≥ 10 kb) were within 10 % of their corresponding reference fragment size, indicating high accuracy in fragment sizing. Collectively for the four genomes, close to 27 % of all marked up molecules were assembled into contigs for final assemblies. In a more recent study that characterized a highly reorganized multiple myeloma genome , close to 29 % of all marked up molecules were assembled. We would like to stress that the performance metrics that we have mentioned encompass errors at the different stages of optical mapping, namely: DNA presentation, digestion, labeling, surface inconsistencies, imaging and image processing. While the effectiveness of the system as a whole is quantifiable (and ultimately what matters), it is still unclear how a stage-wise characterization of errors can be performed especially in the context of huge interesting genomes (the genome size regime in which optical mapping has the greatest impact).
The time taken to process the images from a single channel is typically faster than the time taken to collect the images. Hence we have not employed parallelization strategies for the image processing. Parallelization strategies will be highly attractive as the speed at which data collection improves. The skeletonization, tiling and sizing modules described in this paper can easily and trivially exploit data parallelism techniques.
Fully automated image processing allowed for rapid analysis of DNA molecules deposited in microchannels, which helped us understand key physical characteristics of the deposition process (such as DNA elongation and deposition density along the microchannel) and design optimal operating parameters for Optical Mapping . This enabled the generation of massive Rmap datasets, which facilitated high resolution analysis of genomes of various sizes.
Rmap assemblies provide long range structural information about the genome. Consequently, they generate a scaffold that can be used to verify or guide DNA sequencing based genome assemblies. Optical Mapping was first used to verify sequencing based chromosomal  and genome assemblies . With an increase in throughput, it was used to generate physical assemblies to aid sequencing based genome assembly for many microbial genomes. These include some bacterial genomes like Deinococcus radiodurans , Escherichia coli O157:H7 , Yersinia pestis  and Rhodobacter sphaeroides 2.4.1 . By comparing different bacterial strains to identify genomic differences, Optical Mapping was used for comparative genomics . More recently, plant genomes like rice  and maize [7, 30] and normal  and cancer  human genomes have been mapped. These assemblies have helped in validation of sequencing based assemblies and have also provided high-resolution scaffolds for gap closure and for correcting sequencing based assembly errors .
In the past decade, advances in genome analysis methods have highlighted the widespread presence of structural changes in normal and disease-affected human genomes [32–34]. However, these variants have been found to be selectively enriched in segmentally duplicated and other low complexity regions of the genome [32, 35]. Because of the inability of short-read DNA sequencing data to uniquely differentiate these regions, true positives are difficult to discern in these regions. Additionally, false negative rates as high as 37 % have been reported , which could still be an underestimate. It is because of these reasons that different sequencing based structural variation calling algorithms show very little overlap . Optical Mapping of human genomes has uncovered a wide array of structural variation in these genomes. Teague et al. identified thousands of structural polymorphisms, ranging in size from a few kilobases to megabases in a complete hydatidiform mole and three lymphoblast-derived cell lines . The authors also identified many structural variants that could not be detected by other genomic analysis platforms. Later, Ray et al. studied tumor genomes from two oligodendroglioma patient samples, the first use of Optical Mapping to study a solid tumor genome, to reveal many somatic structural variants and copy number heterogeneity . More recently, we integrated long-range structural variation analysis from Optical Mapping and short range variation analysis from DNA sequencing data to comprehensively characterize variation in a multiple myeloma genome at different stages of disease progression .
Many other genome analysis platforms have been developed in the recent years to understand long range genome structure and structural variation. BioNano Genomics Irys technology has been used to identify structural variants in human genomes . Pacific Biosciences SMRT sequencing  and Oxford Nanopore Technologies sequencing  have increased the average read length from hundreds of bases to tens of kilobases. Although affected by significantly higher error rates when compared to their short read sequencing counterparts, these platforms can provide long-range sequencing information about genomes. Moving forward, developing computational methods and pipelines that integrate results from mapping- and sequencing-based platforms, or better, leverage raw datasets to improve sequencing pipelines, will help us learn more about whole genomes.
The successful implementation of the automated image processing techniques described in this review has allowed the high resolution analysis of many complex genomes. It has also enabled the study of the physical characteristics of DNA deposition using microfuidic systems. In addition variants of the image processing techiques described in this review have been incorporated into the Nanocoding system, a higher resolution and more accurate successor to Optical Mapping [23, 26].
We would like to thank Prof. David C. Schwartz for his invaluable insights, support and guidance during the writing of this review. We would also like to acknowledge Konstantinos Potamousis, Michael Place, Steve Goldstein, Shiguo Zhou and Michael Bechner for helpful discussions and feedback. PR and AG would like to thank NHGRI for support (R01HG000225).
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Schwartz DC, Li X, Hernandez LI, Ramnarain SP, Huff EJ, Wang YK. Ordered restriction maps of Saccharomyces cerevisiae chromosomes constructed by optical mapping. Science. 1993; 262(5130):110–4.View ArticlePubMedGoogle Scholar
- Dimalanta ET, Lim A, Runnheim R, Lamers C, Churas C, Forrest DK, et al.A microfluidic system for large DNA molecule arrays. Anal Chem. 2004; 76(18):5293–301.View ArticlePubMedGoogle Scholar
- Teague B, Waterman MS, Goldstein S, Potamousis K, Zhou S, Reslewic S, et al.High-resolution human genome structure by single-molecule analysis. Proc Natl Acad Sci USA. 2010; 107(24):10848–53.View ArticlePubMedPubMed CentralGoogle Scholar
- Lin J, Qi R, Aston C, Jing J, Anantharaman TS, Mishra B, et al.Whole-genome shotgun optical mapping of Deinococcus radiodurans. Science. 1999; 285(5433):1558–62.View ArticlePubMedGoogle Scholar
- Lim A, Dimalanta ET, Potamousis KD, Yen G, Apodoca J, Tao C, et al.Shotgun optical maps of the whole Escherichia coli O157:H7 genome. Genome Res. 2001; 11(9):1584–93.View ArticlePubMedPubMed CentralGoogle Scholar
- Zhou S, Bechner MC, Place M, Churas CP, Pape L, Leong SA, et al.Validation of rice genome sequence by optical mapping. BMC Genomics. 2007; 8:278.View ArticlePubMedPubMed CentralGoogle Scholar
- Zhou S, Wei F, Nguyen J, Bechner M, Potamousis K, Goldstein S, et al. A single molecule scaffold for the maize genome. PLoS Genet. 2009; 5(11):1000711.View ArticleGoogle Scholar
- Antonacci F, Kidd JM, Marques-Bonet T, Teague B, Ventura M, Girirajan S, et al.A large and complex structural polymorphism at 16p12.1 underlies microdeletion disease risk. Nat Genet. 2010; 42(9):745–50.View ArticlePubMedPubMed CentralGoogle Scholar
- Ray M, Goldstein S, Zhou S, Potamousis K, Sarkar D, Newton MA, et al.Discovery of structural alterations in solid tumor oligodendroglioma by single molecule analysis. BMC Genomics. 2013; 14:505.View ArticlePubMedPubMed CentralGoogle Scholar
- Gupta A, Place M, Goldstein S, Sarkar D, Zhou S, Potamousis K, et al.Single-molecule analysis reveals widespread structural variation in multiple myeloma. Proc Natl Acad Sci. 2015; 112(25):7689–694. doi:10.1073/pnas.1418577112.View ArticlePubMedPubMed CentralGoogle Scholar
- Schwartz DC, Cantor CR. Separation of yeast chromosome-sized DNAs by pulsed field gradient gel electrophoresis. Cell. 1984; 37(1):67–75.View ArticlePubMedGoogle Scholar
- Meng X, Benson K, Chada K, Huff EJ, Schwartz DC. Optical mapping of lambda bacteriophage clones using restrictions endonucleases. Nat Genet. 1995; 9(4):432–8.View ArticlePubMedGoogle Scholar
- Cai W, Aburatani H, Stanton VP, Housman DE, Wang YK, Schwartz DC. Ordered restriction endonuclease maps of yeast artificial chromosomes created by optical mapping on surfaces. Proc Natl Acad Sci USA. 1995; 92(11):5164–168.View ArticlePubMedPubMed CentralGoogle Scholar
- Hu X. Development of optical primer extension (ope), and, improvement and characterization of the optical mapping system. 1997. PhD thesis, New York University.Google Scholar
- Jing J, Reed J, Huang J, Hu X, Clarke V, Edington J, et al.Automated high resolution optical mapping using arrayed, fluid-fixed DNA molecules. Proc Natl Acad Sci USA. 1998; 95(14):8046–051.View ArticlePubMedPubMed CentralGoogle Scholar
- Skiadas J, Aston C, Samad A, Anantharaman TS, Mishra B, Schwartz DC. Optical PCR: genomic analysis by long-range PCR and optical mapping. Mamm Genome. 1999; 10(10):1005–9.View ArticlePubMedGoogle Scholar
- Zhou S, Kile A, Bechner M, Place M, Kvikstad E, Deng W, et al.Single-molecule approach to bacterial genomic comparisons via optical mapping. J Bacteriol. 2004; 186(22):7773–782.View ArticlePubMedPubMed CentralGoogle Scholar
- Anantharaman T, Mishra B, Schwartz D. Genomics via optical mapping. III: Contiging genomic DNA. In: Proc Int Conf Intell Syst Mol Biol: 1999. p. 18–27. http://www.bioinformatics.org/texmed/.
- Valouev A, Li L, Liu YC, Schwartz DC, Yang Y, Zhang Y, et al.Alignment of optical maps. J Comput Biol. 2006; 13(2):442–62.View ArticlePubMedGoogle Scholar
- Valouev A, Zhang Y, Schwartz DC, Waterman MS. Refinement of optical map assemblies. Bioinforma. 2006; 22(10):1217–24.View ArticleGoogle Scholar
- Valouev A, Schwartz DC, Zhou S, Waterman MS. An algorithm for assembly of ordered restriction maps from single DNA molecules. Proc Natl Acad Sci USA. 2006; 103(43):15770–75.View ArticlePubMedPubMed CentralGoogle Scholar
- Cormen TH, Leiserson CE, Rivest RL, Stein C. Introduction to Algorithms, 3rd edn. Cambridge, MA: The MIT Press; 2009.Google Scholar
- Jo K, Dhingra DM, Odijk T, de Pablo JJ, Graham MD, Runnheim R, Forrest D, Schwartz DC. A single-molecule barcoding system using nanoslits for DNA analysis. Proc Natl Acad Sci USA. 2007; 104(8):2673–678.View ArticlePubMedPubMed CentralGoogle Scholar
- Cao H, Hastie AR, Cao D, Lam ET, Sun Y, Huang H, et al.Rapid detection of structural variation in a human genome using nanochannel-based genome mapping technology. Gigascience. 2014; 3(1):34.View ArticlePubMedPubMed CentralGoogle Scholar
- Mendelowitz L, Pop M. Computational methods for optical mapping. Gigascience. 2014; 3(1):33.View ArticlePubMedPubMed CentralGoogle Scholar
- Zhou S, Potamousis K, Goldstein S, Place M, Bechner M, Ravindran P, et al. Optical mapping and nanocoding systems: Single molecule discovery for genome assembly and structural variation [abstract]. Plant and Animal Genome XXI. 2013:6653. https://pag.confex.com/pag/xxi/webprogram/Paper6653.html.
- Jing J, Lai Z, Aston C, Lin J, Carucci DJ, Gardner MJ, et al.Optical mapping of Plasmodium falciparum chromosome 2. Genome Res. 1999; 9(2):175–81.PubMedPubMed CentralGoogle Scholar
- Zhou S, Deng W, Anantharaman TS, Lim A, Dimalanta ET, Wang J, et al.A whole-genome shotgun optical map of Yersinia pestis strain KIM. Appl Environ Microbiol. 2002; 68(12):6321–31.View ArticlePubMedPubMed CentralGoogle Scholar
- Zhou S, Kvikstad E, Kile A, Severin J, Forrest D, Runnheim R, et al.Whole-genome shotgun optical mapping of Rhodobacter sphaeroides strain 2.4.1 and its use for whole-genome shotgun sequence assembly. Genome Res. 2003; 13(9):2142–151.View ArticlePubMedPubMed CentralGoogle Scholar
- Wei F, Zhang J, Zhou S, He R, Schaeffer M, Collura K, et al.The physical and genetic framework of the maize B73 genome. PLoS Genet. 2009; 5(11):1000715.View ArticleGoogle Scholar
- Reslewic S, Zhou S, Place M, Zhang Y, Briska A, Goldstein S, et al.Whole-genome shotgun optical mapping of Rhodospirillum rubrum. Appl Environ Microbiol. 2005; 71(9):5511–522.View ArticlePubMedPubMed CentralGoogle Scholar
- Korbel JO, Urban AE, Affourtit JP, Godwin B, Grubert F, Simons JF, et al.Paired-end mapping reveals extensive structural variation in the human genome. Science. 2007; 318(5849):420–6.View ArticlePubMedPubMed CentralGoogle Scholar
- Zhang F, Gu W, Hurles ME, Lupski JR. Copy number variation in human health, disease, and evolution. Annu Rev Genomics Hum Genet. 2009; 10(1):451–81. doi: 10.1146/annurev.genom.9.081307.164217. PMID: 19715442.View ArticlePubMedPubMed CentralGoogle Scholar
- Conrad DF, Pinto D, Redon R, Feuk L, Gokcumen O, Zhang Y, et al.Origins and functional impact of copy number variation in the human genome. Nature. 2010; 464(7289):704–12.View ArticlePubMedGoogle Scholar
- Sudmant PH, Kitzman JO, Antonacci F, Alkan C, Malig M, Tsalenko A, et al. Diversity of human copy number variation and multicopy genes. Science. 2010; 330(6004):641–6.View ArticlePubMedPubMed CentralGoogle Scholar
- Biankin AV, Waddell N, Kassahn KS, Gingras MC, Muthuswamy LB, Johns AL, et al. Pancreatic cancer genomes reveal aberrations in axon guidance pathway genes. Nature. 2012; 491(7424):399–405.View ArticlePubMedPubMed CentralGoogle Scholar
- Alkan C, Coe BP, Eichler EE. Genome structural variation discovery and genotyping. Nat Rev Genet. 2011; 12(5):363–76.View ArticlePubMedPubMed CentralGoogle Scholar
- Eid J, Fehr A, Gray J, Luong K, Lyle J, Otto G, et al.Real-time DNA sequencing from single polymerase molecules. Science. 2009; 323(5910):133–8.View ArticlePubMedGoogle Scholar
- Branton D, Deamer DW, Marziali A, Bayley H, Benner SA, Butler T, et al.The potential and challenges of nanopore sequencing. Nat Biotechnol. 2008; 26(10):1146–53.View ArticlePubMedPubMed CentralGoogle Scholar