Skip to main content

Whole genome sequence analysis of BT-474 using complete Genomics’ standard and long fragment read technologies



The cell line BT-474 is a popular cell line for studying the biology of cancer and developing novel drugs. However, there is no complete, published genome sequence for this highly utilized scientific resource. In this study we sought to provide a comprehensive and useful data set for the scientific community by generating a whole genome sequence for BT-474.


Five μg of genomic DNA, isolated from an early passage of the BT-474 cell line, was used to generate a whole genome sequence (114X coverage) using Complete Genomics’ standard sequencing process. To provide additional variant phasing and structural variation data we also processed and analyzed two separate libraries of 5 and 6 individual cells to depths of 99X and 87X, respectively, using Complete Genomics’ Long Fragment Read (LFR) technology.


BT-474 is a highly aneuploid cell line with an extremely complex genome sequence. This ~300X total coverage genome sequence provides a more complete understanding of this highly utilized cell line at the genomic level.

Peer Review reports

Data description

Utility of the dataset

The cell line BT-474 was isolated by Lasfargues et al. [1] in 1978, from a biopsy of invasive ductal carcinoma from a 60 year old Caucasian female. Since that time it has become one of the most heavily utilized cell lines for breast cancer research. At the time of writing, entering the search term “BT-474 OR BT474” into PubMed resulted in 973 unique articles. Surprisingly, the complete genome sequence of this cell line has yet to be published. In this paper, we fill that void in the collective scientific knowledge by providing high coverage whole genome data for BT-474.

Previous studies have shown that BT-474 has a modal number of chromosomes approximating tetraploidy, and most of these chromosomes are covered with megabase-sized amplifications, deletions, and other structural rearrangements [2]. In an effort to provide better coverage of these complex rearranged regions, and to provide variant phasing and error correcting information, we generated high coverage libraries from long genomic DNA (~40 kb) using Long Fragment Read (LFR) technology [3, 4], and supplemented those libraries with a standard (STD) short mate pair library (~500 bp) [5] for a combined total coverage of over ~300X. We hope the freely available resource provided in this paper will benefit our understanding of the biology of cancer, and ultimately help to improve therapies for patients.

Library generation

DNA was isolated from low passage number BT-474 cells, procured from the American Type Culture Collection (ATCC, Manassas, VA, USA), using a RecoverEase dialysis kit (Agilent, Santa Clara, CA, USA). This material was further fragmented to 300–800 base pairs using a Covaris E220 (Covaris, Woburn, MA, USA), and processed using Complete Genomics’ proprietary standard library construction [5]. For LFR libraries, approximately 5 cells were collected and deposited into a 1.5 ml microtube with 10 μl of distilled water. Cells were lysed, and DNA was denatured using 1 μl of 20 mM KOH and 0.5 mM EDTA. Denatured genomic DNA was dispersed across a 384-well plate. In each well, long genomic fragments (~40 kb) were amplified, fragmented, and tagged with a unique barcode adapter as previously described [3]. All libraries were sequenced using Complete Genomics’ nanoarray sequencing platform [5].

BT-474 genome analysis

Read data of 343, 298, and 261 Gb from the STD, LFR1, and LFR2 libraries, respectively, were mapped to the NCBI human reference genome (build 37) using Complete Genomics’ pipeline [3, 5, 6] (Table 1), resulting in close to ~100X coverage in each of the libraries. The high coverage allowed more than 90 % of the genome and exome of each library to be called (Table 1.). Plotting reads falling within 100 kb consecutive windows for the BT-474 standard library resulted in the expected complex pattern of amplifications affecting almost all chromosomes [2] (Fig. 1). Known amplifications of ERBB2 and the HOX gene cluster on chromosome 17 [2] are readily identifiable from this plot, as well as many other megabase-sized highly amplified regions. Analysis of both standard and LFR libraries resulted in the discovery of 110, 175, and 145 interchromosomal translocations in the STD, LFR1, and LFR2 libraries, respectively (Table 2). Clustering these translocations, based on windows of 5 kb around the breakpoints, led to the overlap of many translocations within and between libraries, and an overall reduction in the total number of translocations to 291 (Table 2 and Fig. 2). Additionally, comparing our results to a published RNA sequencing analysis of BT-474 [7, 8] demonstrated that three of the five coding interchromosomal translocations were called in our data (Table 3). In the remaining two translocations that were not called by our algorithms, raw reads were found to support their existence in our libraries; for the STARD3-DOK5 translocation, improved algorithms would most likely detect this event. In the case of the TRPC4AP-MRPL45 translocation only one mate pair read in the STD library was found in support, making it unlikely to have been called even with modifications to our algorithms.

Table 1 BT-474 genome statistics
Fig. 1
figure 1

100 kb read coverage. For the standard library of BT-474 reads were averaged across consecutive 100 kb bins, normalized to a tetraploid copy number, and plotted such that each dot represents the coverage of a single 100 kb region of the genome. Y-axis shows haploid copy number; x-axis shows genome position increasing from left to right for chromosome and position

Table 2 Potential translocations identified in BT-474
Fig. 2
figure 2

Overlap of inter-chromosomal translocations between libraries. Interchromosomal translocations were called for each library. The overlap between libraries was determined by considering translocations found within 5 kb of each other, and in the same orientation, to be the same event. This also resulted in the aggregation of multiple close translocations within the same sample into a single event. In total, 109 translocations were found in the standard library (black), and 147 and 133 were found in LFR libraries 1 (blue) and 2 (green), respectively. Of these, 85 interchromosomal translocations are shared between at least two libraries

Table 3 Translocations confirmed by published RNA sequencing data

Single nucleotide variants (SNVs) numbering 3.24 million were called in the STD library, and over 2.85 million in each of the LFR libraries. Of these, 2.84 million were called in all libraries (Fig. 3), demonstrating good reproducibility between different methods of library construction. For all libraries the ratio of heterozygous to homozygous was close to 1; a ratio much lower than the expected ~1.6 for Caucasian genomes. This is most likely the result of loss of heterozygosity (LOH) from the deletion or multi-copy amplifications of large portions, and/or the complete parental copy of almost all chromosomes in the BT-474 genome, as seen in our data (Fig. 1 and Fig. 4), and as previously described [2]. This was confirmed by estimating what would happen to heterozygous variants in the NA12878 genome (the sample used by the ‘Genome in a Bottle’ Consortium [9]) in two scenarios: if the same percentage of the genome was LOH based on 1) the percentage of the genome lost, or 2) the percentage of variants lost (22.2 % and 20.3 %, respectively, Table 4). In both cases the ratio of heterozygous to homozygous variants was reduced to close to 1 (Table 5).

Fig. 3
figure 3

Overlap of called variations between libraries. Single nucleotide variants (SNVs) numbering 3.24 million were called in the STD library, and over 2.85 million in each of the LFR libraries. The overlap between each library was compared and plotted. The standard library (black), and LFR libraries 1 (blue) and 2 (green) are highly overlapping, demonstrating that the majority of the variant calls are highly reproducible between separately processed sequencing libraries

Fig. 4
figure 4

Circos plot of the BT-474 genome. Chromosome number (in large bold numbers and letters), chromosome position (in small numbers), and a karyotype ideogram form the outer circle of the plot. The remaining circles are described in order of outermost to innermost: called ploidy (the copy number of region; blue-gray), Lesser Allele Fraction (LAF, the fraction of the lesser allele, 0.5 for a heterozygous SNP, 0 for a homozygous SNP; green), density of heterzogous SNPs (orange), and density of homozygous SNPs (blue). Lines in the center of the plot represent interchromosomal junctions

Table 4 Calculation of the amount of LOH in BT-474
Table 5 NA12878 simulation of LOH event in BT-474

Analysis of the coding regions of a comprehensive list of known cancer-causing genes [10, 11] identified 67 small variants (<50 base pairs, Additional file 1). Most of these are probably inherited variants with no involvement in tumor formation, however variants in TP53 and PIK3CA, previously found as somatic mutations in many tumors [12], were found in this cell line (Additional file 1). Also identified in our data: a potentially inherited variant in CHEK2, listed as ‘likely to be pathogenic’ in the ClinVar database [13]. To demonstrate the quality of our variant calls we compared them to a list generated by targeted sequencing of BT-474 as part of the Cancer Cell Line Encyclopedia (CCLE) project [14]. When the data from all three libraries were combined, 92 % of the variants found in CCLE were also called in our data, suggesting that our BT-474 genome is of good quality (Fig. 5 and Additional file 2). Further, 130 variants were found in two or more of our libraries that were not found in the CCLE data. This is either because the exons in which these variants were found were not covered as part of the CCLE target set, these variants were missed in the CCLE sequencing analysis, or to a lesser extent they are false positives in our dataset (Additional file 2).

Fig. 5
figure 5

Data directory tree. The output from the LFR process consists of a series of files and folders. A complete description of everything contained within the Complete Genomics data package can be found in the Additional file 3

Availability of supporting data

Complete Genomics data formats

The entire data set from Complete Genomics, provided here, consists of a series of files and directories covering various categories of whole genome analysis (Fig. 6). A complete description of all files and the methods used to generate them can be found in the “Standard Sequencing Service Data File Formats v2.5” document provided by Complete Genomics, Inc. (available in Additional file 3 [15]).

Fig. 6
figure 6

Overlap of called variations between libraries and CCLE data. Variants in the standard (black), and LFR 1 (blue) and 2 (green) libraries from this study found in the genes analyzed in the CCLE study, were compared to those variants called for BT-474 by the CCLE (orange). Eighty-nine percent of variants found by CCLE (orange) in BT-474 were also found within at least one of our libraries

LFR-specific files

Data packages from LFR do not include directories for structural variation (SV) or mobile element insertion (MEI; for more information on the content of these directories see the “Standard Sequencing Service Data File Formats v2.5” file mentioned above. In addition, one of the fields in the variant file (hapLink) is modified and there are six new fields described below:

  • hapLink: LFR phased variants have an ID with the pattern: “Phased_#_#_#”, where # is an integer, the first two #s describe unique contigs, and the last # in the series is either 1 or 0 and represents the two possible haplotypes for each contig. All SNPs sharing the same “Phased_#_#_#” are from the same haplotype.

  • wellCount: total number of LFR wells (out of 384) containing sequence reads calling the variant or reference allele. This metric is used to identify polymerase-induced false positive calls, since it is unlikely that random polymerase errors will occur in multiple different wells. A complete explanation of this concept can be found in Peters et al. [4].

  • wellIDs: contains the IDs of the specific wells from which reads calling the variant come.

  • exclusiveWellCount: this is the number of wells at each locus that have reads calling only the variant or the reference allele (not both). For true heterozygous variants this number should be close to that obtained for “wellCount”.

  • SharedWellCount: at each locus this is the number of wells that contain reads calling both alleles. For true heterozygous variants this should be low; having a high number here suggests mapping errors. For homozygous variants almost all of the well counts should be in this field.

  • MinExclusiveWellCountInThisLocus: this is the minimum number of exclusive wells (non-shared well counts) at each locus.

  • MaxExculisveWellCountInThisLocus: this is the maximum number of exclusive wells (non-shared well counts) at each locus.

LFR structural variant analysis files

Each LFR genome contains an LFR-specific structural variant file in the ASM directory (see Fig. 2 for directory tree). This file is generated using a novel algorithm that identifies unexpected mate-pairs that are found in more than one compartment of an LFR library (manuscript in preparation). A full description of the headers can be found within each file under the Excel tab labeled “Header Description”.

Read and mapping data for all genomes reported here are available at the European Nucleotide Archive (ENA) under study accession number PRJEB10587. Sample accession numbers for each sequence library can be found in Table 1. Supporting data is also available from the GigaScience GigaDB database [16].



Long Fragment Read technology


Complete genomics standard library


Structural variations


European Nucleotide Archive


Spectral karyotype


Single nucleotide variant


Cancer Cell Line Encyclopedia


  1. Lasfargues EY, Coutinho WG, Redfield ES. Isolation of two human tumor epithelial cell lines from solid breast carcinomas. J Natl Cancer Inst. 1978;61(4):967–78.

    CAS  PubMed  Google Scholar 

  2. Rondon-Lagos M, Verdun Di Cantogno L, Marchio C, Rangel N, Payan-Gomez C, Gugliotta P, et al. Differences and homologies of chromosomal alterations within and between breast cancer cell lines: a clustering analysis. Mol Cytogenet. 2014;7(1):8. doi:10.1186/1755-8166-7-8.

    Article  PubMed Central  PubMed  Google Scholar 

  3. Peters BA, Kermani BG, Sparks AB, Alferov O, Hong P, Alexeev A, et al. Accurate whole-genome sequencing and haplotyping from 10 to 20 human cells. Nature. 2012;487(7406):190–5. doi:10.1038/nature11236.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  4. Peters BA, Kermani BG, Alferov O, Agarwal MR, McElwain MA, Gulbahce N, et al. Detection and phasing of single base de novo mutations in biopsies from human in vitro fertilized embryos by advanced whole-genome sequencing. Genome Res. 2015;25(3):426–34. doi:10.1101/gr.181255.114.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  5. Drmanac R, Sparks AB, Callow MJ, Halpern AL, Burns NL, Kermani BG, et al. Human genome sequencing using unchained base reads on self-assembling DNA nanoarrays. Science. 2010;327(5961):78–81. doi:10.1126/science.1181498.

    Article  CAS  PubMed  Google Scholar 

  6. Carnevali P, Baccash J, Halpern AL, Nazarenko I, Nilsen GB, Pant KP, et al. Computational techniques for human genome resequencing using mated gapped reads. J Comput Biol. 2012;19(3):279–92. doi:10.1089/cmb.2011.0201.

    Article  CAS  PubMed  Google Scholar 

  7. Edgren H, Murumagi A, Kangaspeska S, Nicorici D, Hongisto V, Kleivi K, et al. Identification of fusion genes in breast cancer by paired-end RNA-sequencing. Genome Biol. 2011;12(1):R6. doi:10.1186/gb-2011-12-1-r6.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  8. Kangaspeska S, Hultsch S, Edgren H, Nicorici D, Murumagi A, Kallioniemi O. Reanalysis of RNA-sequencing data reveals several additional fusion genes with multiple isoforms. PLoS One. 2012;7(10):e48745. doi:10.1371/journal.pone.0048745.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  9. Zook JM, Chapman B, Wang J, Mittelman D, Hofmann O, Hide W, et al. Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls. Nat Biotechnol. 2014;32(3):246–51. doi:10.1038/nbt.2835.

    Article  CAS  PubMed  Google Scholar 

  10. Lawrence MS, Stojanov P, Mermel CH, Robinson JT, Garraway LA, Golub TR, et al. Discovery and saturation analysis of cancer genes across 21 tumour types. Nature. 2014;505(7484):495–501. doi:10.1038/nature12912.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  11. Vogelstein B, Papadopoulos N, Velculescu VE, Zhou S, Diaz Jr LA, Kinzler KW. Cancer genome landscapes. Science. 2013;339(6127):1546–58. doi:10.1126/science.1235122.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  12. COSMIC. Accessed 10/01/2015.

  13. Landrum MJ, Lee JM, Riley GR, Jang W, Rubinstein WS, Church DM, et al. ClinVar: public archive of relationships among sequence variation and human phenotype. Nucleic Acids Res. 2014;42(Database issue):D980–5. doi:10.1093/nar/gkt1113.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  14. Barretina J, Caponigro G, Stransky N, Venkatesan K, Margolin AA, Kim S, et al. The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity. Nature. 2012;483(7391):603–7. doi:10.1038/nature11003.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  15. Complete Genomics.

  16. Serban Ciotlos; Qing Mao; Rebecca Yu Zhang; Zhenyu Li; Robert Chin; Natali Gulbahce; Sophie Jia Liu; Radoje Drmanac; Brock A Peters (2016): Supporting materials for “Whole genome sequence analysis of BT-474 using Complete Genomics’ standard and Long Fragment Read technologies”. GigaScience Database.

Download references


We would like to acknowledge the ongoing contributions and support of all Complete Genomics employees, in particular the many highly skilled individuals working in the libraries, reagents, and sequencing groups, who make it possible to generate high quality whole genome data.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Brock A. Peters.

Additional information

Competing interests

The authors are shareholders in BGI holdings and Complete Genomics. BGI derives income from whole genome sequencing.

Authors’ contributions

BAP, SJL, and RD conceived the study. RYZ and RC cultured cells, isolated genomic DNA, and generated all of the sequencing libraries. SC, NG, QM, and ZL processed and analyzed the data. All authors read and approved the final manuscript.

Additional files

Additional file 1:

Cancer-associated genes with variants in BT-474. (PDF 123 kb)

Additional file 2:

Comparison of calls to CCLE. (PDF 1347 kb)

Additional file 3:

Standard Sequencing Service Data File Formats v2. (PDF 6212 kb)

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (, which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ciotlos, S., Mao, Q., Zhang, R.Y. et al. Whole genome sequence analysis of BT-474 using complete Genomics’ standard and long fragment read technologies. GigaSci 5, 8 (2016).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: