Case report and sample collection
The patient was a 57-year-old male, diagnosed with primary muscle-invasive transitional cell carcinoma (MI-TCC) of the bladder according to the 2004 World Health Organization (WHO)/ International Society of Urological Pathology (ISUP) grading systems. The details of diagnosis were as follows ( Additional file 1: Figures S1 and Additional file 3: Figure S2): a 7 × 4 × 4 cm papillary tumor in the trigone of the urinary bladder, invaded the bladder wall – the cut surface was grey-white and brittle, and classified as stage II (T2-N0M0) under microscopic examination. Tumor invasion into shallow muscularis of bladder wall was observed, but it did not penetrate deep into the muscularis and serous layer. Furthermore the tumor did not invade the prostate tissue or cause a bilateral ureteral obstruction, and there was no metastasis in six pelvic lymph nodes. The hematoxylin eosin-stained sections prepared using the cancerous tissues were microscopically evaluated by two independent pathologists ( Additional file 3: Figure S2). Afterward the primary tumor sample and matched para-carcinoma tissue were obtained from a surgery-resected sample from this MI-TCC patient at the Peking University Shenzhen Hospital. Informed written consent was obtained from the study participant. The studies were conducted in accordance with the Declaration of Helsinki II and were approved by the local Ethical Committees.
Single-cell isolation and WGA
Every step during the experiment was reduced to a strict minimum. With sufficient dispersion and cascade dilution of cells, single cells were randomly isolated from collagenase I and IV digested tumor or para-carcinoma tissues into PCR-ready tubes using an inverted microscope and a self-made mouth-controlled, fine hand-drawn micro-capillary pipetting system. The single cell isolation was visually confirmed by photograph under microscope. Afterward, WGA of single-cell DNA were performed using the REPLI-g® Mini Kit (Qiagen GmbH, Hilden, Germany) according to the manufacturer’s instruction, using a no cell reaction as a negative control, and a reaction of human tissue genomic DNA as positive control.
Quantitation and genome-integrity assessment of the WGA products
The DNA concentration of the WGA products were measured with highly sensitive fluorescence-based Quant-iTTM assays using the QubitTM Quantitation Platform (Life Technologies, Invitrogen, Carlsbad, CA, USA) according to the manufacturer’s instruction. The genomic integrity of the qualified products (> 60 ng/μl) were then assayed with their amplification represented by 10 housekeeping-gene-PCR tests, and the 10 genes (PRDX6, RPL37a, ADD1, ARHGEF7, EIF2B2, PSMD7, PSMB6, MC2R, BCAT2, and ATP5O) were interspersed across 10 different chromosomes. Afterward, WGA products of best performance from both housekeeping PCR (≥8/10) and Qubit assays (> 60 ng/μl) were selected for downstream experiments. Every step was performed along with a sample of genomic DNA from human tissue as a positive control and a no template reaction as a negative control, respectively.
Whole exome capture and sequencing of samples
High molecular weight genomic DNA was extracted from the primary tumor and matched para-carcinoma tissue, respectively. And for both DNA samples (single cells and matched tissues), whole exome capture was accomplished based on liquid phase hybridization of sonicated 2 μg genomic DNA to the bait cRNA library synthesized on magnetic beads using the SureSelect® Human All Exon 50 Mb kit (Agilent Technologies, Santa Clara, CA, USA) according to the manufacturer’s protocol. The captured targets were subjected to massive sequencing using the Illumina HiSeq2000 with the paired-end 100 bp read option, according to the manufacturer’s instruction.
Public dataset access
Human (Homo sapiens) reference genome sequence (Hg18) and its annotation files (e.g., dbSNP v128) were downloaded from the University of California Santa Cruz Genome Bioinformatics site . The targeted region files of exome capture were downloaded from the Agilent Technologies website .
Reference-guide genome assembly
SOAPaligner/SOAP2 version 2.20 was used to align all short sequencing reads to the Hg18 reference genome with a maximum of two mismatches and non gap parameters. The insert size distribution of each library was checked by Eland contained in the Illumina Pipeline, and thus the insert size range was set for SOAPaligner.
All reads uniquely mapped to exome region and 100 bp-flanking regions were selected for SNP calling. SOAPsnp version 1.03 was used for calculating the likelihood of each cell genotype and mixed tissue. In each sample, putative SNPs were filtered based on the following criteria: (1) a Q20 quality cutoff; (2) at least five reads; (3) a p-value >0.01 (that means no significant difference between the sequencing quality of the two alleles in a heterozygous genotype); (4) a 5 bp distance from each other; and (5) 1/3 ~ 3 variation between quality score of two bases in heterozygous sites. And the final SNP data of the filtered SNP in each cell was combined.
Evaluation of ADO rate
With the final SNPs, the ADO ratio is defined as the random non-amplification of one of the alleles present in a heterozygous sample.
We first defined a background dbSNP subset, which was read-covered in both tissue sequencing of the tumor and matched normal among dbSNP. We then determined the heterozygous coverage ratio per sequencing depth, which was calculated by dividing the heterozygous number of this sample by the total read-covered number in this background dbSNP subset, for single normal cell and normal tissue sequencing, respectively. We then calculated the relative FNR per depth in a single normal cell,
n the sequencing depth
T the tissue sequencing
S the single cell sequencing
HR the heterozygous rate of one kind of sequencing under n × sequencing depth
CR the read-covered rate of one kind of sequencing under n × sequencing depth
We thus calculated the ADO of a single normal cell, which was the median of all relative FNR per depth on non-outlier depths (the outlier depth was the depth when the FNR in tissue sequencing at this depth was higher than the SCS). And finally we calculated the ADO in all whole-exome SCS as the mean of ADO of all normal cells.
Evaluation of FDR
The FDR is defined as a false heterozygous site in a homozygous sample, which may be due to amplification, hybridization or sequencing errors. We chose a homozygous subset which was high-confidence (Quality-score = 99 and 95% confidence interval of distributive depth under Poisson distribution), and then we calculated the number of discrepant sites of single qualified normal cell in the homozygous subset. The FDR per cell was indicated as the number of discrepant sites divided by total covered number of high-confidence homozygous subset. And finally the FDR in all whole-exome SCS was indicated as the mean of all qualified normal cells.
Identification and experimental validation of somatic mutations
To eliminate the random errors or artifacts induced by SCS, we built two binomial tests to detect high-confidence point somatic mutations (SMs) in capture regions. The putative SMs were filtered based on the following criteria: (1) homozygous normal in all normal cells (at least read-covered in six normal cells); (2) the homozygous genotypes in normal cells were consistent in normal tissue; (3) at least mutant in three cancer cells among a total of 44 qualified cancer cells.
The criterion (1) was set by a binomial test with ADO, qualified normal cell number, and whole-exome size to eliminate the random errors.
i the normal cell number of read-covered for a specific mutation
n the total number of qualified normal cells
a the ADO
P(i) the probability under binomial distribution
S the whole-exome size
Then the criterion (1) was set according to the largest cell number i fulfills above inequation.
And the criterion (3) was set by a binomial test with FDR, qualified cancer cell number, whole-exome size to eliminate the random errors.
i the cancer cell number of mutant of a specific mutation
n the total number of qualified cancer cell
P(i) the probability under binomial distribution
S the whole-exome size
Then the criterion (3) was set according to the largest cell number fulfills above inequation.
High confident SMs in BC cells were randomly selected for experimental validation. The loci of corresponding single cells were amplified, sequenced and analyzed by MassARRAY® Analyzer (SEQUENOM, San Diego, CA, USA).
Correlation of SMAF between SCS and whole tissue sequencing
Correlation coefficient of determination is a goodness-of-fit measure for models based on the proportion of explained variance. The somatic mutant allele frequency in SCS was indicated as unfolded site frequency of mutant alleles, and the somatic mutant allele frequency in tissue sequencing was indicated as read frequency of mutant alleles.
Cell population genetic inferences based on called (inferred) SNPs can lead to serious biases and possibly false inferences for the high ADO ratio for each cell. We have therefore developed a series of statistical techniques that can take uncertainty in genotype calls and allele frequency estimation into account. Instead of utilizing a Bayesian estimation-based method called site-frequency spectrum (SFS) on population individuals , we used this SFS to calculate the somatic allele frequency in each SM site of all BC and BN cells respectively, based on the same methods of estimating allele frequencies from reads in one site and additional estimating sample allele frequencies.
To identify the most variable factors in classifying subgroups among single cancer cells, we utilized an R package, pcaMethods v1.12.0 , in performing the PCA based on the genotyping result at all somatic mutation sites on each single cell. Missing values were automatically estimated by probabilistic method within the R package.
Heat map 2-dimension clustering
The 2-dimension heat map clustering of mutant genes and cells is based on the somatic mutations using R language. Color indicates the presence of tendency of mutant or normal, which was showed as log10 value of the relative probability of mutant to the probability of normal (non-mutant) after Bayes calibration by FDR, ADO and a priori probability from normal tissue.
For each mutant point site which is covered by sequencing reads in each cell, we first derived the mutant posterior probability by Bayes calibration with FDR, ADO and a priori
probability from normal tissue. The values in heat map array and corresponding posterior mutant probability were calculated by the following formulas:
is one kind of the three genotypes (NN, NC, CC for homozygous normal, heterozygous mutant, homozygous mutant, respectively) in the site of each cell.
is one observed genotype (NN or NC or CC).
is the mutant posterior probability of A
on the observation of O
(Ai) is the priori probability of A
calculated from the corresponding site at normal tissue.
is the probability of A
calibrated by ADO and FDR.
H is the logarithmic value of the relative probability of mutant to the probability of normal used for heat map 2-dimension clustering.
The priori probability of A
calculated as following formulas:
Note: Quality score and rank sum test p-value are the values from the files after sequencing read alignment by SOAP.
The probability of A
was calibrated by ADO and FDR as following formulas:
is the ADO in homozygous genotype.
is the ADO in heterozygous genotype.
is the FDR in single cell sequencing.
If a site is not covered by reads in one cell, then we estimate its log10 value by averaging all the log10 values in other cells at this site.
The columns and the rows were also ordered as in the tree with sub clusters. And certain suspicious genes illustrated as red (likely mutant) in BN cells, which were defined as ≥1 BN cell containing value ≥0.5 (light red), in our raw data were filtered in heat map clustering. Having been tackled with FDR, ADO and other sources of error, we suggested that the remaining suspicious genes may due to paralogous alignment or false negatives under our model. Under extreme low FDR value, the false negatives here might derive from super high sensitivity of our heat map model to even a low number of error sequencing reads in a genotype, that is, if there exist two mutant reads in a total of 30 sequencing reads in BN cells, its adjusted heat map-array value was likely to higher than 1 (red) in our model. However, two error sequencing reads out of 30 reads is understandable in a current second-generation sequencing platform. Finally, the remaining 51 genes (yellow-colored genes in Additional file 5: Table S2B) were used for the heat map and its matched cell-lineage tree analyses.
Concurrent and exclusive mutation analysis
Concurrent and exclusive mutation analysis was performed with two Perl packages [41, 42]. The depths of color in array were the concurrent and exclusive p-value between each two selected genes.
Function analysis of key genes
To analyze the functional impact of the somatic mutations in key genes, we first investigated the protein sequence of each key gene from the UniProtKB/Swiss-Prot database . The amino acid changes were determined by the human genome annotation files from the University of California Santa Cruz Genome Bioinformatics site . We then mapped the abnormal amino acid changes to the protein sequence of each gene and determined which domain had been altered. Further, altered signaling pathways were determined by mapping the key genes to the Kyoto Encyclopedia of Genes and Genomes (KEGG) database .