Comparison of variations detection between whole-genome amplification methods used in single-cell resequencing
- Yong Hou†1,
- Kui Wu†1,
- Xulian Shi†1, 2,
- Fuqiang Li†1,
- Luting Song†1,
- Hanjie Wu†1,
- Michael Dean3,
- Guibo Li1,
- Shirley Tsang4,
- Runze Jiang1,
- Xiaolong Zhang1, 5,
- Bo Li1,
- Geng Liu1,
- Niharika Bedekar6,
- Na Lu1, 2,
- Guoyun Xie1,
- Han Liang1,
- Liao Chang1,
- Ting Wang7,
- Jianghao Chen7,
- Yingrui Li1,
- Xiuqing Zhang8,
- Huanming Yang1, 9, 10,
- Xun Xu1Email author,
- Ling Wang7Email author and
- Jun Wang1, 9, 11Email author
© Hou et al. 2015
Received: 3 November 2014
Accepted: 13 June 2015
Published: 6 August 2015
Single-cell resequencing (SCRS) provides many biomedical advances in variations detection at the single-cell level, but it currently relies on whole genome amplification (WGA). Three methods are commonly used for WGA: multiple displacement amplification (MDA), degenerate-oligonucleotide-primed PCR (DOP-PCR) and multiple annealing and looping-based amplification cycles (MALBAC). However, a comprehensive comparison of variations detection performance between these WGA methods has not yet been performed.
We systematically compared the advantages and disadvantages of different WGA methods, focusing particularly on variations detection. Low-coverage whole-genome sequencing revealed that DOP-PCR had the highest duplication ratio, but an even read distribution and the best reproducibility and accuracy for detection of copy-number variations (CNVs). However, MDA had significantly higher genome recovery sensitivity (~84 %) than DOP-PCR (~6 %) and MALBAC (~52 %) at high sequencing depth. MALBAC and MDA had comparable single-nucleotide variations detection efficiency, false-positive ratio, and allele drop-out ratio. We further demonstrated that SCRS data amplified by either MDA or MALBAC from a gastric cancer cell line could accurately detect gastric cancer CNVs with comparable sensitivity and specificity, including amplifications of 12p11.22 (KRAS) and 9p24.1 (JAK2, CD274, and PDCD1LG2).
Our findings provide a comprehensive comparison of variations detection performance using SCRS amplified by different WGA methods. It will guide researchers to determine which WGA method is best suited to individual experimental needs at single-cell level.
KeywordsWhole genome amplification Single-cell resequencing Variations detection DOP-PCR MDA MALBAC Next-generation sequencing
Variations detection in single-cell resequencing (SCRS) research has enabled numerous advances in heterogeneity analysis , including cancer research [2–5], haplotype studies [6, 7], single-neuron sequencing , and detection of aneuploidy and unbalanced chromosomal rearrangement in pre-implantation screening/diagnosis [9, 10]. The direct sequencing of single cells has been limited by the picogram amount of DNA in individual cells; hence, whole genome amplification (WGA) is usually used to increase the amount of DNA before sequencing library preparation.
Currently, three WGA strategies are widely used for SCRS: degenerate-oligonucleotide-primed polymerase chain reaction (DOP-PCR) [11, 12], multiple displacement amplification (MDA) [13–15], and a combination of displacement pre-amplification and PCR amplification (marketed as PicoPlex kit by Rubicon Genomics [16, 17], MALBAC kit by Yikon Genomics [7, 18, 19]). These three WGA strategies differ in the enzymes used and in the experimental protocol design, which may yield different performances and biases, allowing for different specific applications. Quake et al. reported the comparison of CNVs detection, single-nucleotide variations (SNVs) detection and de-novo genome assembly using single-cell Escherichia coli DNA amplified by these three methods, with the corresponding bulk DNA as control . He et al. compared the performance of genome coverage efficiency, reproducibility, GC bias, genome coverage uniformity and CNVs detection of 11 hippocampal neurons also amplified by these three methods at low-coverage sequencing depth . Voet et al. reported the variations detection performance comparison using human cell line and blastomeres amplified by MDA and PicoPlex WGA .
However, although it is known that the WGA strategies may introduce artifacts and cause errors in variations detection , there is still no comprehensive comparison of the amplification bias and variations detection performance of the widely used commercialized kits completely based on these three strategies. To systematically evaluate the SCRS performance of commonly used WGA methods, we performed single-cell WGA using seven kits, with several experimental replicates for each kit, and then sequenced the whole genome of the successfully amplified DNA. We designed a narrowing-down strategy to investigate the amplification and variations detection performance cost-efficiently. First, we evaluated the mapping ratio, duplication ratio, and genome coverage uniformity using the single-cell low-coverage whole genome sequencing (LWGS) data or the extracted single-cell LWGS data. By evaluating the amplification quality during LWGS comparison, we selected the kits with best genome recovery sensitivity or uniformity. Using the further deep-sequenced whole genome sequencing (WGS) data amplified by the chosen kits, we further investigated the amplification bias and variations detection ability. In this way, we found that DOP-PCR methods had the highest duplication ratio and limited mapping efficiency and genome recovery - presumably as a result of the PCR process - but also that DOP-PCR methods had the best reproducibility and accuracy for detection of CNVs. In addition, we found that MDA and MALBAC had comparable genome recovery sensitivity, higher than that of DOP-PCR. Furthermore, we found that SCRS data from MDA also had comparable SNVs detection accuracy and CNVs detection accuracy to that of MALBAC. Our results provide a comprehensive comparison of variations detection performance at single-cell level between different WGA methods, and guidance for researchers to choose best suited WGA methods when performing variations detection at single-cell level.
For the cancer cell line data, we downloaded the MALBAC-amplified WGS data of five single cells derived from the SW480 human colon cancer cell line and corresponding bulk SW480 sequencing data from the NCBI Short Read Archive (SRA060929).
Finally, we obtained 10 single cells from a human gastric cancer cell line (called BGC823), amplified five by MALBAC and five by MDA-2, and sequenced them to ~0.5X depth on Lifetech Ion Proton Sequencer. We also obtained the WGS data of the bulk DNA of BGC823 as a control (Additional file 2: Table S2).
Comparison of low-coverage single-cell WGS performance
To gain more insights into the distinction of the mapping ratio between different WGA methods, we then investigated unmapped reads for their GC content, sequencing quality, and mapping quality. We found no significant difference in the GC content of unmapped reads between the methods (Additional file 5: Figure S1). However, we found a significantly different N ratio for the unmapped reads among the three WGA methods, with that for MALBAC being the highest (Bonferroni-corrected Mann–Whitney-Wilcoxon test, p < 0.001) and that for MDA being the lowest (Bonferroni-corrected Mann–Whitney-Wilcoxon test, p < 0.001) (Additional file 6: Figure S2). The lowest N ratio seen in MDA-amplified data could be explained by the high fidelity of the Phi29 polymerase. Also, the different amplification primers and the different sequencing quality may cause the N ratio distinction, either.
We compared the read distribution uniformity using 0.1X extracted data from all the YH single cells mentioned above. We simulated the theoretic sequencing depth distribution which followed the Poisson distribution (1000 dots, λ = 30) and was normalized by dividing by 30. We then found that the mean normalized sequencing depth distribution of DOP-1 data was most similar with the theoretic one, whereas all other amplification kits had observed bias (Fig. 2b). Overall, the mean normalized depth distribution biases for DOP-PCR methods, MDA-2 and MALBAC were lower than those of MDA-1 and MDA-3. DOP-PCR and MALBAC showed higher reproducibility than MDA (Additional file 7: Figure S3, Additional file 8: Table S5, Bonferroni-corrected Mann–Whitney-Wilcoxon test, p < 0.05).
Using the 0.1X extracted data, we then assessed the regional reads distribution in one genomic region in which there were no copy-number alterations in the YH-mix data (chr15: q11.1-q26.3). The read distribution for DOP-PCR, MDA-2, and MALBAC had better evenness and reproducibility than other WGA kits, and the MDA-1 read distribution demonstrated the highest bias (Bonferroni-corrected Mann–Whitney-Wilcoxon test, p < 0.001), as also found in a previous report  (Fig. 2c, Additional file 9: Figure S4 and Additional file 10: Figure S5).
In summary, SCRS data amplified by MDA or MALBAC had a lower duplication ratio, a higher mapping ratio, and a higher genome recovery than that from DOP-PCR. DOP-PCR, MDA-2 and MALBAC amplified data showed high uniformity and reproducibility. All three amplification strategies could potentially provide a uniform distribution of sequencing reads, which is important for CNVs analysis at the single-cell level.
Deep single-cell WGS and bias evaluation
Deep-sequencing statistics of single cells amplified by different kits
Number of mapped bases (bp)
Read mapping ratio (%)
Read duplication ratio (%)
GC content (%)
Mean depth (X)
Genome coverage (%)
To further determine the specific regional bias and GC bias introduced by WGA, we next used the deep-sequenced data to evaluate the normalized depth distribution in Alu and L1 repeat regions and regions with different GC content. We plotted the distributions of normalized depth in each Alu and L1 region, compared with the distribution in entire genome split into 100 kb windows. We observed that the normalized depth distribution of DOP-1 amplified data in Alu and L1 regions was significantly lower than that at whole-genome level, and DOP-1 amplified data had the greatest difference of normalized depth distribution between the repeat regions and whole genome among different WGA methods (Fig. 3b, Bonferroni-corrected Mann–Whitney-Wilcoxon test, p < 0.001). In addition, the read distribution of SCRS data from DOP-1 was influenced slightly by GC content, as the result of the unamplified control of YH-mix (Fig. 3c), whereas high GC content influenced the read distribution of the MDA-3 data (Bonferroni-corrected Mann–Whitney-Wilcoxon test, p < 0.001).
Using the deep-sequenced data, we performed extra comparison of assembly performance between MDA and MALBAC amplified data, and found that MALBAC may have comparable assembly quality as MDA but lower stability of the assembly than MDA by mitochondrial assembly (See details in Additional file 12: Supplementary Note).
Assessment of artifacts introduced by different WGA methods
Comparison of consensus genotypes and SNVs detection accuracy of deep-sequenced data amplified by MDA and MALBAC
Golden control for SW480 cells
Golden control for YH cells
To further investigate the potential biological impact of these discordant genotypes sites (present in single cells but not in the golden control) in SCRS data introduced by the WGA, we sorted out the discordant SNVs among these discordant genotype sites in three deep-sequenced single cells amplified by MDA-2, and then annotated these discordant SNVs using ANNOVAR  (Additional file 14: Table S8). We found that most of the altered genes that contained discordant SNVs occurred only in one of the three cells, and only ~ 4 % of the altered genes were shared among all of the three cells (Additional file 15: Figure S6), indicating that the artifacts introduced by the MDA-1 were unlikely to influence the gene category analysis.
Because MDA frequently introduced chimeras , we used deep WGS data of two single YH cells (MDA-2_47 and MDA-2_66) and another two sets of 10–20 single YH cells (MDA-2_M6 and MDA-2_M16, Additional file 3: Table S3) to evaluate the amplification chimeras. We performed breakpoints identification using CREST  in these samples as well as YH-mix as a control (Methods). We defined the chimeras as the breakpoints appeared only in the single-cell data rather than in YH-mix. Of the different types of breakpoints such as the insertion (INS), deletion (DEL), inversion (INV), intra-chromosomal translocation (ITX) and inter-chromosomal translocation (CTX), we found that chimeric ITX (Additional file 16: Figure S7) was the dominant chimera type (82.08 %, Fig. 3e). In addition, we found a significant difference of length distribution between true ITXs (shared by the YH-mix) and chimeric ITXs in single cells (Fig. 3f), suggesting that the chimeras tended to be produced by neighboring amplicons randomly connecting on the same chromosome, as previously reported . The percentages of other chimera types, such as chimeric CTX (Additional file 17: Figure S8), deletion, insertion and inversion, were 1.13 %, 8.09 %, 5.07 %, and 3.68 %, respectively.
Single-cell SNVs and CNVs detection accuracy of the WGA methods
Comparison of consensus genotypes and SNVs detection accuracy of deep-sequenced data amplified by MDA and MALBAC
YH-mix (Unamplified control)
777,908 (5563/390,038/37 %)
1,747,004 (390,107/84 %)
2,524,912 (395,670/58 %)
1,807,282 (6517/14,124/87 %)
1,562,036 (14,177/96 %)
3,369,318 (20,694/91 %)
1,651,733 (6347/55,158/80 %)
1,587,456 (55,195/95 %)
3,239,189 (61,542/87 %)
To further investigate the power of CNVs detection using real SCRS data amplified by MALBAC and MDA-2, we amplified 10 additional cells from a human gastric adenocarcinoma cell line (BGC823) using MALBAC (5 cells) and MDA-2 (5 cells) respectively, and sequenced them on Lifetech Ion Proton sequencer. BGC823 bulk sequencing data was used as the unamplified data control. We also introduced 7 YH single cells data which were amplified by MALBAC (3 cells) or MDA-2 (4 cells) and then sequenced on Lifetech Ion Proton sequencer. We found no recurrent CNVs (≥1 Mb) in the single YH cells and YH-mix, and did not identify any obvious different CNVs compared with YH cells sequenced on Illumina platform (Additional file 2: Table S2), indicating that different sequencing platforms (Illumina and Lifetech Ion Proton Sequencers) made few impacts on the CNVs detection comparison. We observed 213 major CNVs larger than 1 Mb in the bulk sequencing data of BGC823 (Additional file 22: Table S11), and most of the major CNVs in bulk sequencing data of BGC823 overlapped with CNVs in single BGC823 cells, including amplification regions that include the oncogene KRAS (12p11.22-p11.21) and the recently reported recurrent amplification at 9p24.1 at the locus containing JAK2, CD274, and PDCD1LG2 (which augments the anti-tumor immune response)  (Fig. 4b, Additional file 23: Figure S11a, Additional file 24: Figure S11b and Additional file 22: Table S11). Treating the bulk sequencing data of BGC823 as control, we estimated that the MALBAC-amplified BGC823 SCRS data achieved a mean sensitivity of 84.72 % (SD 0.82 %) and a mean specificity of 85.18 % (SD 1.61 %), while MDA-2 amplified BGC823 SCRS data achieved a mean sensitivity of 85.86 % (SD 10.27 %) and a mean specificity of 81.18 % (SD 8.90 %), indicating that MALBAC provided a higher specificity and slightly lower sensitivity than MDA-2 (Additional file 25: Table S12). This result is different with our simulated data, which may be caused by difference in CNVs complexity between different cancer cell lines. In addition, MALBAC showed a higher reproducibility among replicates than MDA-2 in CNVs detection (Additional file 26: Table S13, Mann–Whitney-Wilcoxon test, p < 0.01), which is consistent with simulation data result. However, taking our findings on CNVs together, we concluded that the SCRS data from both MALBAC and MDA-2 could robustly identify CNVs larger than 1 Mb.
Here, we provided a comprehensive comparison of single-cell variations detection performance basing on different WGA methods. We first performed LWGS analysis of single cells using three major WGA methods: MDA, DOP-PCR, and MALBAC. The results indicated that SCRS data generated by MDA-2 (MDA using the Qiagen REPLI-g Single Cell Kit) presented higher genome recovery sensitivity than those generated by MALBAC and DOP-PCR with the same sequencing depth. SCRS data from DOP-PCR had the lowest amplification bias along the entire genome, as well as high reproducibility and the highest single-cell CNVs detection accuracy (>90 %). In contrast to previous reports , our analysis showed that MDA-2 and MALBAC had similarly favorable detection accuracy and efficiency for single-cell SNVs and CNVs detection, although MDA and MALBAC introduced FP sites, ADO sites, and amplification bias.
DOP-PCR based sequencing data showed high duplication ratio and limited genome recovery sensitivity in our study, indicating that this method may not be suitable for detecting additional SNVs and structural variations at deep sequencing depth. However, DOP-PCR has also been reported to accurately detect aneuploidy and unbalanced chromosomal rearrangements, achieving 99.63 % sensitivity and 97.71 % specificity for detecting CNVs larger than 1 Mb . Considering our result together, we suggest that DOP-PCR methods are suitable for studies focusing on the analysis involving number of sequencing reads, such as CNVs or aneuploidy detection in pre-implantation screening/diagnosis, cancer research or other disease research.
We found that MALBAC sequencing data had intermediate genome recovery sensitivity, and uniformity for CNVs detection. A previous study showed that MALBAC was advantageous for SNVs and CNVs detection in SCRS data compared with MDA (based on the kit called MDA-1 here) . However, when we compared the SNVs and CNVs detection performance of the MDA-2 kit (an optimized version of the MDA-1 kit), we found that the MDA-2 data had higher genome recovery than the MALBAC data with the same sequencing depth (Additional file 4: Table S4, Additional file 27: Table S14). More importantly, we found that the MDA-2 data had a comparable SNVs detection accuracy and CNVs detection accuracy with those of the MALBAC data; and this accuracy was greater than that indicated by a previous report for MDA-1 . Taken together, these data suggest that optimization of MDA experimental protocols may significantly improve SNVs and CNVs detection in SCRS data. Thus, we conclude that both MDA and MALBAC can be used for research that require low duplication ratio and high genome coverage, for example, the detection of SNVs in disease research. In addition, if researchers need to perform SNVs and CNVs detection at the same time in some fields like tumor heterogeneity and evolution research, we recommend using MDA-2 and/or MALBAC because of their higher efficiency and accuracy in variations detection. However, MALBAC may have higher reproducibility of uniformity and of CNVs detection performance than MDA-2, making MALBAC more conducive to the heterogeneity research related to variations detection.
Although MDA method would introduce chimeras during WGS, our analysis indicated that chimeras of MDA-2 had potential to detect the breakpoints of structural variations for specific types of structural variations at the single-cell level, such as inter-chromosomal structural variations, with the possibility of increasing the specificity by reducing the number of random chimeras in an increasing number of replicate cells.
A remaining challenge for variations detection at the single-cell level is the cost. Unlike bulk sequencing, single-cell analysis needs to amplify the whole genome of the single cell first. The cost, especially for a large number of cells to be amplified before sequencing, will be considerable when taking the failure ratio into consideration. The MDA and DOP-PCR are the most widely used WGA methods even before the single-cell sequencing occurs, and their costs are relatively low, especially if using homemade reagents following the freely available protocol. However, MALBAC is a new method with more complex experimental procedure that was developed especially for single-cell sequencing, and thus the cost will be higher than that of MDA and DOP. We believe that more detailed published protocols and more users will help further reduce the cost of MALBAC for single-cell amplification. Another approach that may reduce the cost significantly for all three amplification methods could be microfluidics, which would limit the reaction into a very small volume (several nanoliters) for a large number of amplified single cells .
Our results provide a comprehensive comparison of variations detection performance in SCRS with different WGA methods. It will guide researchers to choose the most optimal WGA method to perform specific single cell sequencing project in research areas such as analysis of circulating tumor cells and tumor evolution, and pre-implantation screening and diagnosis.
Sample preparation before WGA
A total of 39 single cells were collected in our study, 29 from a lymphoblastoid cell line (YH cell line) established from the first Asian genome donor , the rest from a widely known gastric cancer cell line, BGC823. Corresponding bulk DNA was extracted as an unamplified control. The BGC823 cell line was provided by Youyong Lv at Beijing Cancer Hospital. All samples and experimental protocols were approved by the Institutional Review Board of BGI-Shenzhen.
Single cells were isolated as described previously . Briefly, following sufficient dissociation and dilution of cells, single cells were randomly picked up using a mouth pipette under a microscope and washed three times in phosphate-buffered saline to avoid exogenous DNA contamination, then transferred into a PCR tube. Single-cell isolation was confirmed by microscopy to ensure that only one cell was inside each tube.
WGA of single-cell genomic DNA with different WGA methods
WGA was performed using seven different commercial kits based on MDA, DOP-PCR or MALBAC strategies. The kits used were Qiagen REPLI-g Mini Kit (MDA-1), Qiagen REPLI-g Single Cell Kit (MDA-2), GE Healthcare illustra GenomiPhi V2 DNA Amplification Kit (MDA-3), GenomePlex® Single Cell WGA Kit (DOP-1), Silicon Biosystem Ampli1™ WGA Kit (DOP-2), NEB Single Cell WGA Kit (DOP-3), and Yikon Genomics Single Cell WGA Kit (MALBAC). All experimental operations followed the manufacturers’ protocols strictly and without any modification.
Library construction and whole-genome DNA sequencing
The Illumina sequencer and LifeTech Ion Proton sequencer were used as the sequencing platforms in this study. To construct the library for each cell on the Illumina platform, 1–2 μg amplified genomic DNA was used. After fragmentation, the ‘A’ adaptor was ligated to each fragment. Next, 10 cycles of PCR using 8-base barcode primers was performed. After the DNA concentration and insert size measurement, the libraries were processed for paired-end high-throughput sequencing on Illumina HiSeq2000/HiSeq2500/MiSeq sequencer with a mean depth of ~ 0.5X. Libraries with outstanding performance in either recovery sensitivity or evenness of low-coverage sequencing were further deeply sequenced to around 30X. For LifeTech Ion Proton sequencing, a Bioruptor instrument was used to fragment DNA. The desired size of DNA fragments were obtained and ligated with Ion Proton A and P1 adaptors at each end, and then selected using E-Gel EX 2 % Gel (Invitrogen, Carlsbad, CA) for 150- to 200-bp fragments. The fragments were amplified, and the DNA was purified with Agencourt AMPure XP beads (Beckman Coulter Genomics, High Wycombe, UK). As assessed by the BioAnalyzer High Sensitivity LabChip Agilent, the resulting library had a median fragment size of 180 bp. After dilution, emulsion PCR reactions were set up for each nanoball in the library. Before the nanoballs were placed onto the ION PI chip, a sequencing primer and polymerase were added to the final enriched spheres.
Read processing and mapping
Paired-ended reads generated by Illumina sequencer
Published data  of SW480 SCRS data and bulk SW480 sequencing data were downloaded from the NCBI Short Read Archive with accession no. SRA060929. The WGA primer was trimmed by Trimmomatic  from the 5′ ends of each read: 30 bases for YH cells amplified by DOP-PCR [11, 12] and 35 bases for SW480 and YH cells amplified by MALBAC [7, 18, 19]. Reads of YH cells amplified by MDA [13–15] did not need to be trimmed. Reads were then mapped to the human genome reference (Hg19, Build37) by BWA  (version 62) and SAMtools  (version 0.1.18), and sorted and marked as duplicates by Picard  (version 1.72). 0.1X data was then randomly down-sampled from the alignment results by Picard for each sample.
Single-ended reads generated by LifeTech Ion Proton sequencer
Thirty five bases of WGA primer were trimmed by Trimmomatic from the 5′ ends of each read of BGC and YH single cells amplified by MALBAC . Reads were then mapped to the human genome reference (Hg19, Build37) by TMAP  (version 3.6.40) and SAMtools (version 0.1.18), and sorted and marked as duplicates by Picard (version 1.72).
The alignment result was checked for quality by Qualimap  (version 0.6).
For each deeply sequenced sample, low-quality alignments (mapping quality less than 1, unmapped, duplicates, and non-unique) were filtered using BamTools . Filtered alignments were then processed by GATK  (version 2.3–9) with the options ‘Local Realignment around Indels’ and ‘Base Quality Score Recalibration’. SNVs were called at any callable sites by UnifiedGenotyper (a variation caller of GATK), and trained by a Gaussian mixture model using GATK. All the low-quality SNVs and false-positive SNVs were identified and then filtered based on the log odds ratio under the Gaussian mixture model.
Excluding sequencing errors
In the LWGS study, we only used the Illumina sequencing data to perform the comparison. The sequencing error rates from the Illumina Miseq and Hiseq sequencer have been reported previously . In the comparison, we directly mapped the sequencing reads to the hg19 human reference genome using BWA  with mismatches allowed. As with most variations calling that used resequencing data, we did not correct the sequencing errors of the raw sequencing reads; instead, we excluded the low-quality reads, sorted the mapping data and directly calculated the mapping ratio. To evaluate the bias in the comparison caused by the correctable sequencing errors from Hiseq and Miseq, we extracted the same amount of the sequencing reads amplified by the same kit but sequenced on Hiseq 2000 or Miseq, respectively. We found that there was no significant difference in the mapping ratio or duplication ratio between the cells sequenced by the Hiseq 2000 or Miseq (Additional file 28: Table S15). Thus, we inferred that the conclusions we generated from the LWGS data were not significantly biased by the correctable sequencing errors.
For each deeply sequenced sample, low-quality alignments (mapping quality less than 1, unmapped, duplicates, and non-unique) were filtered using BamTools ;
Alignments were processed by GATK  (version 2.3–9) with the options ‘Local Realignment around Indels’ and ‘Base Quality Score Recalibration’;
SNVs were called at any callable sites by UnifiedGenotyper (a Bayesian model based variation caller of GATK), and trained by a Gaussian mixture model using GATK (this step filtered out the influence of the sequencing errors and the mapping errors);
All the low-quality SNVs and false-positive SNVs were identified and then filtered based on the log odds ratio under the Gaussian mixture model;
The ADO ratio and false-positive ratio were calculated, by comparing the genotypes of single cells with those of the corresponding bulk sequencing data sequenced on the same sequencer. In this way, after the SNVs calling and filtering by the Bayesian model and Gaussian mixture model, we ensured that the sequencing errors did not bias the comparison results.
Single-nucleotide artifacts analysis
We defined different ‘golden controls’ for different cell type data. For the YH single cells, the ‘golden control’ was defined as the concordant genotypes set overlapped between YH-mix data and a commercial 2.5 M Illumina Omni SNP Chip. And for the SW480 single cells, we first obtained an overlap set of concordant genotypes between the two SW480 bulk (SW480-SCD and SW480-HEC) sequencing data to reduce sequencing errors, and defined the ‘golden control’ as the intersection set of genotypes between the ‘overlap set’ and the commercial 2.5 M Illumina Omni SNP Chip. We clustered the genotyped alleles of both the ‘golden control’ and corresponding single cells into three categories: HOMref (homozygotes where both alleles were identical to the hg19 reference genome), HOMmut (homozygotes where both alleles were different with the hg19 reference genome), and HETref (heterozygotes where only one allele was identical to the hg19 reference genome). We formulated the counts of genotyped alleles of single cell sequencing sites that were consistent with ‘golden control’ at both alleles, at one allele, or that were inconsistent at both alleles as 2, 1, and 0, respectively.
For each category (HOMref, HOMmut and HETref), we calculated the consensus genotypes detection efficiency (CGDE) as the ratio of counts of consensus genotypes detected in single cell to those detected in corresponding control. Concordant ratio was defined as the ratio of counts of genotypes which both alleles were identical to the golden control to the genotypes detected in single cell for each category. We then calculated the mean CGDE and concordant ratio of all categories for each single cell.
SNVs detection efficiency, ADO, and false-positive ratio calculation
SNVs detection efficiency was calculated as the ratio of the count of detected SNVs in a given single cell (minus the number of false-positive SNVs) to those in the bulk DNA. The ADO was defined as the non-amplification occurred in alternative alleles present in heterozygous sites. The false positive was defined as the SNVs in single cell sequencing data but not present in the bulk sequencing data. Both the ADO and false positive ratio were calculated by comparing the single cell sequencing data with bulk control sequencing data.
Analysis of the chimera effect
To identify the chimeras at single-cell level, we identified breakpoints using CREST  both in the MDA-2 amplified samples and YH-mix. Taking the YH-mix as the control, the true breakpoints in MDA-2 amplified samples were defined as those overlapped with YH-mix if they were of the same type and were not further apart than a threshold of 100 bp: the rest were considered as chimeras.
CNVs simulation on the YH samples in silicon
Shared regions (≥1 Mb) between SW480-SCD and SW480-HEC with concordant CNVs (the copy number was assumed to be N) were selected as candidate regions for further CNVs simulation. The copy number ratio (assumed to be R) of the candidate region was formulated as the copy number of the region divided by 2 (R = N/2). For each YH sample (YH single cells and YH-mix control data), the simulated reads count (Ks) was defined as the product of the reads count of a bin (Kr) and the copy number ratio (R) of the corresponding candidate region. (Ks = Kr × R). The modified pipeline was then used to call CNVs in the simulated data for each sample.
Data simulation and CNVs calling
Simulated single-ended reads (50 bp) from hg19 were mapped to hg19 by bowtie  (version 1.0.0). 10,000 genomics bins were used in the analysis.
Reads from the 0.1X LWGS alignments (BAM format) were converted to FASTQ format through the single-ended mode by BEDTools , and then re-mapped to hg19 reference genome by bowtie. Bases were trimmed from the 5′ end of each read to ensure that each read was 50 bp long. Raw reads generated by Lifetech Ion Proton sequencer (BAM format) were converted to FASTQ format by BEDTools, and then were trimmed by Trimmomatic to an effective length (50 bp plus length of WGA primer) from the 3′ end of the reads. The resulting alignments were re-mapped to hg19 reference genome by bowtie, and then bases were again trimmed from the 5′ end of each read to ensure each read was 50 bp long.
For each sample, segments were detected by DNAcopy , a circular binary segmentation (CBS) algorithm based CNVs detection tool. The density of the segment ratio of all bins within autosomes was plotted, and the mode of the segment ratio was set corresponding to a copy number of two.
Sensitivity and specificity were calculated (following ) as following:
Simulation on the YH genome in silicon
Shared regions (≥1 Mb) between SW480-SCD and SW480-HEC with concordant CNV (the copy number was assumed to be N) were selected as candidate regions for further CNV simulation. The copy number ratio (assumed to be R) of the candidate region is the copy number of the region divided by 2 (R = N/2). For each of the YH samples, to get the simulated reads count (Ks), the reads count of a bin (Kr) was multiplied by the copy number ratio (R) of the corresponding candidate region. (Ks = Kr × R). The modified pipeline was then used to call CNVs in the simulated data for each sample.
We performed the Mann–Whitney-Wilcoxon test to assess the variations in cases of comparisons between two groups. Pearson correlations were calculated to investigate the similarity between metrics. To control the family-wise error rate, we performed the Bonferroni correction when multiple comparisons were conducted simultaneously.
Availability of supporting data
GenomePlex® Single Cell WGA Kit (St. Louis, MO, USA)
Silicon Biosystem Ampli™ WGA Kit (Silicon Biosystems, Bologna, Italy)
NEB Single Cell WGA Kit (New England Biolabs, Ipswich, MA, USA)
Qiagen REPLI-g Mini Kit (Qiagen, Düsseldorf, Germany)
Qiagen REPLI-g Single Cell Kit (Qiagen, Düsseldorf, Germany)
GE Healthcare illustra GenomiPhi V2 DNA Amplification Kit (GE Healthcare, Little Chalfont, Buckinghamshire, England)
Yikon Genomics Single Cell Whole Genome Amplification Kit
Copy number variations
Degenerate oligonucleotide-primed PCR
Heterozygotes where only one allele is identical to the reference
Homozygotes where both alleles differ from the reference
Homozygotes where both alleles are identical to the reference
Low-coverage whole-genome sequencing
Multiple annealing and looping-based amplification cycles
Multiple displacement amplification
Whole genome sequencing
We thank L Goodman for revising the manuscript and Youyong Lv for providing the BGC823 gastric cancer cell line. This work was supported by the National High Technology Research and Development Program of China - 863 Program (NO.2012AA02A201), the Guangdong Innovative Research Team Program (2009010016), the Guangdong Enterprise Key Laboratory of Human Disease Genomics (No. 2011A060906007), National Basic Research Program of China (973 program No. 2011CB809202 and 2011CB809203), the Major Industrial Technology Research Program of Shenzhen (program number BGI20100001), the Key Laboratory Project Supported by Shenzhen City (grants CXB201108250096A and ZDSYS20140509153457495) and the Shenzhen Key Laboratory of China National GeneBank-Shenzhen. This project was also supported by the National Natural Science Fund (81272899 and 81172510) and Discipline booster plan of Xi Jing Hospital (XJZT12Z07). We also acknowledge the Ole Rømer grant from the Danish Natural Science Research Council, the Danish National Research Foundation, National Natural Science Foundation of China, and funds from the Shenzhen Municipal Government and the Local Government of Yantian District of Shenzhen.
- Macaulay IC, Voet T. Single cell genomics: advances and future perspectives. PLoS Genet. 2014;10(1):e1004126. doi:10.1371/journal.pgen.1004126.View ArticlePubMedPubMed CentralGoogle Scholar
- Navin N, Kendall J, Troge J, Andrews P, Rodgers L, McIndoo J, et al. Tumour evolution inferred by single-cell sequencing. Nature. 2011;472(7341):90–4. doi:10.1038/nature09807.View ArticlePubMedPubMed CentralGoogle Scholar
- Hou Y, Song L, Zhu P, Zhang B, Tao Y, Xu X, et al. Single-cell exome sequencing and monoclonal evolution of a JAK2-negative myeloproliferative neoplasm. Cell. 2012;148(5):873–85. doi:10.1016/j.cell.2012.02.028.View ArticlePubMedGoogle Scholar
- Xu X, Hou Y, Yin X, Bao L, Tang A, Song L, et al. Single-cell exome sequencing reveals single-nucleotide mutation characteristics of a kidney tumor. Cell. 2012;148(5):886–95. doi:10.1016/j.cell.2012.02.025.View ArticlePubMedGoogle Scholar
- Caldas C. Cancer sequencing unravels clonal evolution. Nat Biotechnol. 2012;30(5):408–10. doi:10.1038/nbt.2213.View ArticlePubMedGoogle Scholar
- Wang J, Fan HC, Behr B, Quake SR. Genome-wide single-cell analysis of recombination activity and de novo mutation rates in human sperm. Cell. 2012;150(2):402–12. doi:10.1016/j.cell.2012.06.030.View ArticlePubMedPubMed CentralGoogle Scholar
- Lu S, Zong C, Fan W, Yang M, Li J, Chapman AR, et al. Probing meiotic recombination and aneuploidy of single sperm cells by whole-genome sequencing. Science. 2012;338(6114):1627–30. doi:10.1126/science.1229112.View ArticlePubMedPubMed CentralGoogle Scholar
- Evrony GD, Cai X, Lee E, Hills LB, Elhosary PC, Lehmann HS, et al. Single-neuron sequencing analysis of L1 retrotransposition and somatic mutation in the human brain. Cell. 2012;151(3):483–96. doi:10.1016/j.cell.2012.09.035.View ArticlePubMedPubMed CentralGoogle Scholar
- Van der Aa N, Zamani Esteki M, Vermeesch JR, Voet T. Preimplantation genetic diagnosis guided by single-cell genomics. Genome Med. 2013;5(8):71. doi:10.1186/gm475.View ArticlePubMedPubMed CentralGoogle Scholar
- Yin X, Tan K, Vajta G, Jiang H, Tan Y, Zhang C, et al. Massively parallel sequencing for chromosomal abnormality testing in trophectoderm cells of human blastocysts. Biol Reprod. 2013;88(3):69. doi:10.1095/biolreprod.112.106211.View ArticlePubMedGoogle Scholar
- Telenius H, Carter NP, Bebb CE, Nordenskjold M, Ponder BA, Tunnacliffe A. Degenerate oligonucleotide-primed PCR: general amplification of target DNA by a single degenerate primer. Genomics. 1992;13(3):718–25.View ArticlePubMedGoogle Scholar
- Cheung VG, Nelson SF. Whole genome amplification using a degenerate oligonucleotide primer allows hundreds of genotypes to be performed on less than one nanogram of genomic DNA. Proc Natl Acad Sci USA. 1996;93(25):14676–9.View ArticlePubMedPubMed CentralGoogle Scholar
- Paez JG, Lin M, Beroukhim R, Lee JC, Zhao X, Richter DJ, et al. Genome coverage and sequence fidelity of phi29 polymerase-based multiple strand displacement whole genome amplification. Nucleic Acids Res. 2004;32(9):e71. doi:10.1093/nar/gnh069.View ArticlePubMedPubMed CentralGoogle Scholar
- Dean FB, Hosono S, Fang L, Wu X, Faruqi AF, Bray-Ward P, et al. Comprehensive human genome amplification using multiple displacement amplification. Proc Natl Acad Sci USA. 2002;99(8):5261–6. doi:10.1073/pnas.082089499.View ArticlePubMedPubMed CentralGoogle Scholar
- Spits C, Le Caignec C, De Rycke M, Van Haute L, Van Steirteghem A, Liebaers I, et al. Whole-genome multiple displacement amplification from single cells. Nat Protoc. 2006;1(4):1965–70. doi:10.1038/nprot.2006.326.View ArticlePubMedGoogle Scholar
- Langmore JP. Rubicon Genomics, Inc. Pharmacogenomics. 2002;3(4):557–60. doi:10.1517/146224126.96.36.1997.View ArticlePubMedGoogle Scholar
- Leung K, Zahn H, Leaver T, Konwar KM, Hanson NW, Page AP, et al. A programmable droplet-based microfluidic device applied to multiparameter analysis of single microbes and microbial communities. Proc Natl Acad Sci USA. 2012;109(20):7665–70. doi:10.1073/pnas.1106752109.View ArticlePubMedPubMed CentralGoogle Scholar
- Zong C, Lu S, Chapman AR, Xie XS. Genome-wide detection of single-nucleotide and copy-number variations of a single human cell. Science. 2012;338(6114):1622–6. doi:10.1126/science.1229164.View ArticlePubMedPubMed CentralGoogle Scholar
- Hou Y, Fan W, Yan L, Li R, Lian Y, Huang J, et al. Genome analyses of single human oocytes. Cell. 2013;155(7):1492–506. doi:10.1016/j.cell.2013.11.040.View ArticlePubMedGoogle Scholar
- de Bourcy CF, De Vlaminck I, Kanbar JN, Wang J, Gawad C, Quake SR. A quantitative comparison of single-cell whole genome amplification methods. PLoS One. 2014;9(8):e105585. doi:10.1371/journal.pone.0105585.View ArticlePubMedPubMed CentralGoogle Scholar
- Ning L, Wang G, Li Z, Hu W, Hou Q, Tong Y, et al. Quantitative comparison of single-cell sequencing methods using hippocampal neurons. 2014. doi:10.1101/004291.
- Voet T, Kumar P, Van Loo P, Cooke SL, Marshall J, Lin ML, et al. Single-cell paired-end genome sequencing reveals structural variation per cell cycle. Nucleic Acids Res. 2013;41(12):6119–38. doi:10.1093/nar/gkt345.View ArticlePubMedPubMed CentralGoogle Scholar
- Wang J, Wang W, Li R, Li Y, Tian G, Goodman L, et al. The diploid genome sequence of an Asian individual. Nature. 2008;456(7218):60–5. doi:10.1038/nature07484.View ArticlePubMedPubMed CentralGoogle Scholar
- Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009;25(14):1754–60. doi:10.1093/bioinformatics/btp324.View ArticlePubMedPubMed CentralGoogle Scholar
- Wang K, Li M, Hakonarson H. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res. 2010;38(16):e164. doi:10.1093/nar/gkq603.View ArticlePubMedPubMed CentralGoogle Scholar
- Lasken RS, Stockwell TB. Mechanism of chimera formation during the Multiple Displacement Amplification reaction. BMC Biotechnol. 2007;7:19. doi:10.1186/1472-6750-7-19.View ArticlePubMedPubMed CentralGoogle Scholar
- Wang J, Mullighan CG, Easton J, Roberts S, Heatley SL, Ma J, et al. CREST maps somatic structural variation in cancer genomes with base-pair resolution. Nat Methods. 2011;8(8):652–4. doi:10.1038/nmeth.1628.View ArticlePubMedPubMed CentralGoogle Scholar
- Baslan T, Kendall J, Rodgers L, Cox H, Riggs M, Stepansky A, et al. Genome-wide copy number analysis of single cells. Nat Protoc. 2012;7(6):1024–41. doi:10.1038/nprot.2012.039.View ArticlePubMedGoogle Scholar
- Cancer Genome Atlas Research N. Comprehensive molecular characterization of gastric adenocarcinoma. Nature. 2014;513(7517):202–9. doi:10.1038/nature13480.View ArticleGoogle Scholar
- Zhang C, Zhang C, Chen S, Yin X, Pan X, Lin G, et al. A single cell level based method for copy number variation analysis by low coverage massively parallel sequencing. PLoS One. 2013;8(1):e54236. doi:10.1371/journal.pone.0054236.View ArticlePubMedPubMed CentralGoogle Scholar
- Yu Z, Lu S, Huang Y. Microfluidic whole genome amplification device for single cell sequencing. Anal Chem. 2014;86(19):9386–90. doi:10.1021/ac5032176.View ArticlePubMedGoogle Scholar
- Bolger AM, Lohse M, Usadel B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics. 2014;30(15):2114–20. doi:10.1093/bioinformatics/btu170.View ArticlePubMedPubMed CentralGoogle Scholar
- Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009;25(16):2078–9. doi:10.1093/bioinformatics/btp352.View ArticlePubMedPubMed CentralGoogle Scholar
- Picard: http://broadinstitute.github.io/picard/.
- TMAP: https://github.com/iontorrent/TS/tree/master/Analysis/TMAP/.
- Garcia-Alcalde F, Okonechnikov K, Carbonell J, Cruz LM, Gotz S, Tarazona S, et al. Qualimap: evaluating next-generation sequencing alignment data. Bioinformatics. 2012;28(20):2678–9. doi:10.1093/bioinformatics/bts503.View ArticlePubMedGoogle Scholar
- Barnett DW, Garrison EK, Quinlan AR, Stromberg MP, Marth GT. BamTools: a C++ API and toolkit for analyzing and managing BAM files. Bioinformatics. 2011;27(12):1691–2. doi:10.1093/bioinformatics/btr174.View ArticlePubMedPubMed CentralGoogle Scholar
- McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 2010;20(9):1297–303. doi:10.1101/gr.107524.110.View ArticlePubMedPubMed CentralGoogle Scholar
- Ross MG, Russ C, Costello M, Hollinger A, Lennon NJ, Hegarty R, et al. Characterizing and measuring bias in sequence data. Genome Biol. 2013;14(5):R51. doi:10.1186/gb-2013-14-5-r51.View ArticlePubMedPubMed CentralGoogle Scholar
- Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009;10(3):R25. doi:10.1186/gb-2009-10-3-r25.View ArticlePubMedPubMed CentralGoogle Scholar
- Quinlan AR, Hall IM. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010;26(6):841–2. doi:10.1093/bioinformatics/btq033.View ArticlePubMedPubMed CentralGoogle Scholar
- DNAcopy: http://www.bioconductor.org/packages/2.12/bioc/html/DNAcopy.html.
- Hou Y WK, Shi X, Li F, Song L, Wu H, et al. Single-cell sequencing data using DOP-PCR, MDA and MALBAC whole genome amplification methods. GigaScience Database. 2015. http://dx.doi.org/10.5524/100115.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.