- Data Note
- Open Access
- Open Peer Review
Whole genome sequence analysis of BT-474 using complete Genomics’ standard and long fragment read technologies
GigaSciencevolume 5, Article number: 8 (2016)
The cell line BT-474 is a popular cell line for studying the biology of cancer and developing novel drugs. However, there is no complete, published genome sequence for this highly utilized scientific resource. In this study we sought to provide a comprehensive and useful data set for the scientific community by generating a whole genome sequence for BT-474.
Five μg of genomic DNA, isolated from an early passage of the BT-474 cell line, was used to generate a whole genome sequence (114X coverage) using Complete Genomics’ standard sequencing process. To provide additional variant phasing and structural variation data we also processed and analyzed two separate libraries of 5 and 6 individual cells to depths of 99X and 87X, respectively, using Complete Genomics’ Long Fragment Read (LFR) technology.
BT-474 is a highly aneuploid cell line with an extremely complex genome sequence. This ~300X total coverage genome sequence provides a more complete understanding of this highly utilized cell line at the genomic level.
Utility of the dataset
The cell line BT-474 was isolated by Lasfargues et al.  in 1978, from a biopsy of invasive ductal carcinoma from a 60 year old Caucasian female. Since that time it has become one of the most heavily utilized cell lines for breast cancer research. At the time of writing, entering the search term “BT-474 OR BT474” into PubMed resulted in 973 unique articles. Surprisingly, the complete genome sequence of this cell line has yet to be published. In this paper, we fill that void in the collective scientific knowledge by providing high coverage whole genome data for BT-474.
Previous studies have shown that BT-474 has a modal number of chromosomes approximating tetraploidy, and most of these chromosomes are covered with megabase-sized amplifications, deletions, and other structural rearrangements . In an effort to provide better coverage of these complex rearranged regions, and to provide variant phasing and error correcting information, we generated high coverage libraries from long genomic DNA (~40 kb) using Long Fragment Read (LFR) technology [3, 4], and supplemented those libraries with a standard (STD) short mate pair library (~500 bp)  for a combined total coverage of over ~300X. We hope the freely available resource provided in this paper will benefit our understanding of the biology of cancer, and ultimately help to improve therapies for patients.
DNA was isolated from low passage number BT-474 cells, procured from the American Type Culture Collection (ATCC, Manassas, VA, USA), using a RecoverEase dialysis kit (Agilent, Santa Clara, CA, USA). This material was further fragmented to 300–800 base pairs using a Covaris E220 (Covaris, Woburn, MA, USA), and processed using Complete Genomics’ proprietary standard library construction . For LFR libraries, approximately 5 cells were collected and deposited into a 1.5 ml microtube with 10 μl of distilled water. Cells were lysed, and DNA was denatured using 1 μl of 20 mM KOH and 0.5 mM EDTA. Denatured genomic DNA was dispersed across a 384-well plate. In each well, long genomic fragments (~40 kb) were amplified, fragmented, and tagged with a unique barcode adapter as previously described . All libraries were sequenced using Complete Genomics’ nanoarray sequencing platform .
BT-474 genome analysis
Read data of 343, 298, and 261 Gb from the STD, LFR1, and LFR2 libraries, respectively, were mapped to the NCBI human reference genome (build 37) using Complete Genomics’ pipeline [3, 5, 6] (Table 1), resulting in close to ~100X coverage in each of the libraries. The high coverage allowed more than 90 % of the genome and exome of each library to be called (Table 1.). Plotting reads falling within 100 kb consecutive windows for the BT-474 standard library resulted in the expected complex pattern of amplifications affecting almost all chromosomes  (Fig. 1). Known amplifications of ERBB2 and the HOX gene cluster on chromosome 17  are readily identifiable from this plot, as well as many other megabase-sized highly amplified regions. Analysis of both standard and LFR libraries resulted in the discovery of 110, 175, and 145 interchromosomal translocations in the STD, LFR1, and LFR2 libraries, respectively (Table 2). Clustering these translocations, based on windows of 5 kb around the breakpoints, led to the overlap of many translocations within and between libraries, and an overall reduction in the total number of translocations to 291 (Table 2 and Fig. 2). Additionally, comparing our results to a published RNA sequencing analysis of BT-474 [7, 8] demonstrated that three of the five coding interchromosomal translocations were called in our data (Table 3). In the remaining two translocations that were not called by our algorithms, raw reads were found to support their existence in our libraries; for the STARD3-DOK5 translocation, improved algorithms would most likely detect this event. In the case of the TRPC4AP-MRPL45 translocation only one mate pair read in the STD library was found in support, making it unlikely to have been called even with modifications to our algorithms.
Single nucleotide variants (SNVs) numbering 3.24 million were called in the STD library, and over 2.85 million in each of the LFR libraries. Of these, 2.84 million were called in all libraries (Fig. 3), demonstrating good reproducibility between different methods of library construction. For all libraries the ratio of heterozygous to homozygous was close to 1; a ratio much lower than the expected ~1.6 for Caucasian genomes. This is most likely the result of loss of heterozygosity (LOH) from the deletion or multi-copy amplifications of large portions, and/or the complete parental copy of almost all chromosomes in the BT-474 genome, as seen in our data (Fig. 1 and Fig. 4), and as previously described . This was confirmed by estimating what would happen to heterozygous variants in the NA12878 genome (the sample used by the ‘Genome in a Bottle’ Consortium ) in two scenarios: if the same percentage of the genome was LOH based on 1) the percentage of the genome lost, or 2) the percentage of variants lost (22.2 % and 20.3 %, respectively, Table 4). In both cases the ratio of heterozygous to homozygous variants was reduced to close to 1 (Table 5).
Analysis of the coding regions of a comprehensive list of known cancer-causing genes [10, 11] identified 67 small variants (<50 base pairs, Additional file 1). Most of these are probably inherited variants with no involvement in tumor formation, however variants in TP53 and PIK3CA, previously found as somatic mutations in many tumors , were found in this cell line (Additional file 1). Also identified in our data: a potentially inherited variant in CHEK2, listed as ‘likely to be pathogenic’ in the ClinVar database . To demonstrate the quality of our variant calls we compared them to a list generated by targeted sequencing of BT-474 as part of the Cancer Cell Line Encyclopedia (CCLE) project . When the data from all three libraries were combined, 92 % of the variants found in CCLE were also called in our data, suggesting that our BT-474 genome is of good quality (Fig. 5 and Additional file 2). Further, 130 variants were found in two or more of our libraries that were not found in the CCLE data. This is either because the exons in which these variants were found were not covered as part of the CCLE target set, these variants were missed in the CCLE sequencing analysis, or to a lesser extent they are false positives in our dataset (Additional file 2).
Availability of supporting data
Complete Genomics data formats
The entire data set from Complete Genomics, provided here, consists of a series of files and directories covering various categories of whole genome analysis (Fig. 6). A complete description of all files and the methods used to generate them can be found in the “Standard Sequencing Service Data File Formats v2.5” document provided by Complete Genomics, Inc. (available in Additional file 3 ).
Data packages from LFR do not include directories for structural variation (SV) or mobile element insertion (MEI; for more information on the content of these directories see the “Standard Sequencing Service Data File Formats v2.5” file mentioned above. In addition, one of the fields in the variant file (hapLink) is modified and there are six new fields described below:
hapLink: LFR phased variants have an ID with the pattern: “Phased_#_#_#”, where # is an integer, the first two #s describe unique contigs, and the last # in the series is either 1 or 0 and represents the two possible haplotypes for each contig. All SNPs sharing the same “Phased_#_#_#” are from the same haplotype.
wellCount: total number of LFR wells (out of 384) containing sequence reads calling the variant or reference allele. This metric is used to identify polymerase-induced false positive calls, since it is unlikely that random polymerase errors will occur in multiple different wells. A complete explanation of this concept can be found in Peters et al. .
wellIDs: contains the IDs of the specific wells from which reads calling the variant come.
exclusiveWellCount: this is the number of wells at each locus that have reads calling only the variant or the reference allele (not both). For true heterozygous variants this number should be close to that obtained for “wellCount”.
SharedWellCount: at each locus this is the number of wells that contain reads calling both alleles. For true heterozygous variants this should be low; having a high number here suggests mapping errors. For homozygous variants almost all of the well counts should be in this field.
MinExclusiveWellCountInThisLocus: this is the minimum number of exclusive wells (non-shared well counts) at each locus.
MaxExculisveWellCountInThisLocus: this is the maximum number of exclusive wells (non-shared well counts) at each locus.
LFR structural variant analysis files
Each LFR genome contains an LFR-specific structural variant file in the ASM directory (see Fig. 2 for directory tree). This file is generated using a novel algorithm that identifies unexpected mate-pairs that are found in more than one compartment of an LFR library (manuscript in preparation). A full description of the headers can be found within each file under the Excel tab labeled “Header Description”.
Read and mapping data for all genomes reported here are available at the European Nucleotide Archive (ENA) under study accession number PRJEB10587. Sample accession numbers for each sequence library can be found in Table 1. Supporting data is also available from the GigaScience GigaDB database .
Long Fragment Read technology
Complete genomics standard library
European Nucleotide Archive
Single nucleotide variant
Cancer Cell Line Encyclopedia
Lasfargues EY, Coutinho WG, Redfield ES. Isolation of two human tumor epithelial cell lines from solid breast carcinomas. J Natl Cancer Inst. 1978;61(4):967–78.
Rondon-Lagos M, Verdun Di Cantogno L, Marchio C, Rangel N, Payan-Gomez C, Gugliotta P, et al. Differences and homologies of chromosomal alterations within and between breast cancer cell lines: a clustering analysis. Mol Cytogenet. 2014;7(1):8. doi:10.1186/1755-8166-7-8.
Peters BA, Kermani BG, Sparks AB, Alferov O, Hong P, Alexeev A, et al. Accurate whole-genome sequencing and haplotyping from 10 to 20 human cells. Nature. 2012;487(7406):190–5. doi:10.1038/nature11236.
Peters BA, Kermani BG, Alferov O, Agarwal MR, McElwain MA, Gulbahce N, et al. Detection and phasing of single base de novo mutations in biopsies from human in vitro fertilized embryos by advanced whole-genome sequencing. Genome Res. 2015;25(3):426–34. doi:10.1101/gr.181255.114.
Drmanac R, Sparks AB, Callow MJ, Halpern AL, Burns NL, Kermani BG, et al. Human genome sequencing using unchained base reads on self-assembling DNA nanoarrays. Science. 2010;327(5961):78–81. doi:10.1126/science.1181498.
Carnevali P, Baccash J, Halpern AL, Nazarenko I, Nilsen GB, Pant KP, et al. Computational techniques for human genome resequencing using mated gapped reads. J Comput Biol. 2012;19(3):279–92. doi:10.1089/cmb.2011.0201.
Edgren H, Murumagi A, Kangaspeska S, Nicorici D, Hongisto V, Kleivi K, et al. Identification of fusion genes in breast cancer by paired-end RNA-sequencing. Genome Biol. 2011;12(1):R6. doi:10.1186/gb-2011-12-1-r6.
Kangaspeska S, Hultsch S, Edgren H, Nicorici D, Murumagi A, Kallioniemi O. Reanalysis of RNA-sequencing data reveals several additional fusion genes with multiple isoforms. PLoS One. 2012;7(10):e48745. doi:10.1371/journal.pone.0048745.
Zook JM, Chapman B, Wang J, Mittelman D, Hofmann O, Hide W, et al. Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls. Nat Biotechnol. 2014;32(3):246–51. doi:10.1038/nbt.2835.
Lawrence MS, Stojanov P, Mermel CH, Robinson JT, Garraway LA, Golub TR, et al. Discovery and saturation analysis of cancer genes across 21 tumour types. Nature. 2014;505(7484):495–501. doi:10.1038/nature12912.
Vogelstein B, Papadopoulos N, Velculescu VE, Zhou S, Diaz Jr LA, Kinzler KW. Cancer genome landscapes. Science. 2013;339(6127):1546–58. doi:10.1126/science.1235122.
COSMIC. http://cancer.sanger.ac.uk/cosmic. Accessed 10/01/2015.
Landrum MJ, Lee JM, Riley GR, Jang W, Rubinstein WS, Church DM, et al. ClinVar: public archive of relationships among sequence variation and human phenotype. Nucleic Acids Res. 2014;42(Database issue):D980–5. doi:10.1093/nar/gkt1113.
Barretina J, Caponigro G, Stransky N, Venkatesan K, Margolin AA, Kim S, et al. The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity. Nature. 2012;483(7391):603–7. doi:10.1038/nature11003.
Complete Genomics. http://www.completegenomics.com/customer-support/documentation/100357139-2/.
Serban Ciotlos; Qing Mao; Rebecca Yu Zhang; Zhenyu Li; Robert Chin; Natali Gulbahce; Sophie Jia Liu; Radoje Drmanac; Brock A Peters (2016): Supporting materials for “Whole genome sequence analysis of BT-474 using Complete Genomics’ standard and Long Fragment Read technologies”. GigaScience Database. doi.org/10.5524/100188
We would like to acknowledge the ongoing contributions and support of all Complete Genomics employees, in particular the many highly skilled individuals working in the libraries, reagents, and sequencing groups, who make it possible to generate high quality whole genome data.
The authors are shareholders in BGI holdings and Complete Genomics. BGI derives income from whole genome sequencing.
BAP, SJL, and RD conceived the study. RYZ and RC cultured cells, isolated genomic DNA, and generated all of the sequencing libraries. SC, NG, QM, and ZL processed and analyzed the data. All authors read and approved the final manuscript.