The whole genome sequences and experimentally phased haplotypes of over 100 personal genomes
- Qing Mao†1,
- Serban Ciotlos†1,
- Rebecca Yu Zhang†1,
- Madeleine P. Ball†2, 3,
- Robert Chin1,
- Paolo Carnevali1,
- Nina Barua1,
- Staci Nguyen1,
- Misha R. Agarwal1,
- Tom Clegg2, 4,
- Abram Connelly2, 4,
- Ward Vandewege2, 4,
- Alexander Wait Zaranek2, 4,
- Preston W. Estep2,
- George M. Church2,
- Radoje Drmanac1, 5 and
- Brock A. Peters1, 5Email author
© The Author(s). 2016
Received: 20 May 2016
Accepted: 19 September 2016
Published: 11 October 2016
Since the completion of the Human Genome Project in 2003, it is estimated that more than 200,000 individual whole human genomes have been sequenced. A stunning accomplishment in such a short period of time. However, most of these were sequenced without experimental haplotype data and are therefore missing an important aspect of genome biology. In addition, much of the genomic data is not available to the public and lacks phenotypic information.
As part of the Personal Genome Project, blood samples from 184 participants were collected and processed using Complete Genomics’ Long Fragment Read technology. Here, we present the experimental whole genome haplotyping and sequencing of these samples to an average read coverage depth of 100X. This is approximately three-fold higher than the read coverage applied to most whole human genome assemblies and ensures the highest quality results. Currently, 114 genomes from this dataset are freely available in the GigaDB repository and are associated with rich phenotypic data; the remaining 70 should be added in the near future as they are approved through the PGP data release process. For reproducibility analyses, 20 genomes were sequenced at least twice using independent LFR barcoded libraries. Seven genomes were also sequenced using Complete Genomics’ standard non-barcoded library process. In addition, we report 2.6 million high-quality, rare variants not previously identified in the Single Nucleotide Polymorphisms database or the 1000 Genomes Project Phase 3 data.
These genomes represent a unique source of haplotype and phenotype data for the scientific community and should help to expand our understanding of human genome evolution and function.
KeywordsComplete genomics Haplotypes Long fragment read LFR Personal Genome Project PGP Whole genome sequencing
Utility of the dataset
In 2003, after 13 years of dedicated research and the public release of the reference human genome – a high-quality genome against which all later genome sequences would be compared – the Human Genome Project was officially declared complete, representing a stunning achievement for science and humanity. Since then, DNA sequencing technologies have rapidly improved and the cost of sequencing has outpaced Moore’s Law for almost the past 10 years . More than 200,000 human genomes have been sequenced during this time but unfortunately, because an individual can be identified from their genome sequence, issues of anonymity have caused much of this data to sit in limited access databases. In addition, many datasets that are available to the public lack rich phenotypic data and so are of limited use. Finally, haplotype data has only been resolved for a small number of genomes and this important aspect of biology has been almost completely ignored in many studies. The genomic data published in this dataset represents the largest set of freely accessible whole individual genome sequences with phenotype and experimental haplotype information and should help to further our understanding of human biology.
As part of the Personal Genome Project (PGP), blood samples from 184 participants were collected and processed using Complete Genomics’ Long Fragment Read (LFR) technology (Additional file 1). These participants gave full consent to have their genotypic and phenotypic data (Additional file 2) made freely and publicly available. Documents reviewed and signed by each participant can be found at . Each PGP participant is given an opportunity to review their genome data and decide if they still wish to make it public. This process increases the time and uncertainty of data release and is the reason the complete set is currently not available. However, it is expected that the majority of these datasets will be released soon.
Blood was thawed at Complete Genomics and 3 ml was used for DNA isolation with a RecoverEase dialysis kit (Agilent, Santa Clara, CA, USA). The remaining ~2 ml was frozen for later sample identification confirmation. High molecular weight genomic DNA was lightly fragmented by pipetting in and out of a P1000 pipette tip (Rainin Instruments LLC, Oakland, CA) 20 to 40 times. The final DNA concentration was measured using a Quant-iT™ Broad-Range dsDNA Assay Kit (Thermo Fisher Scientific, Waltham, MA). DNA samples were normalized to 10 ng/μl and stored at 4 °C.
Library preparation and sequencing
Approximately 200 pg of genomic DNA (equivalent to the amount of DNA present in less than 30 cells) from each sample was used to make an LFR library  for sequencing. To make these libraries, about 5 ng of the genomic DNA was first denatured in 160 mM KOH and 1.0 mM EDTA in a total volume of 100 μl. After a 5-min incubation at 20 °C, 7 μl of this material was transferred to 26 μl of a 1 mM concentration of random DNA 8-mers in dH2O. After 2 min, an additional 32 μl of dH2O was added to the 8-mer solution resulting in a concentration of ~5.4 pg/μl of DNA and 400 μM of 8-mer in a final volume of 65 μl. 100 nl of this mixture was dispersed across a 384-well plate using a Mosquito® HTS Nanoliter Liquid Handler (TTP Labtech, Cambridge, UK) such that the final amount of DNA in each well was ~0.54 pg (~200 pg per LFR library). Multiple displacement amplification (MDA)  mix was added to a final volume of 1 μl to amplify the long genomic DNA fragments. After amplification, controlled random enzymatic (CoRE) fragmenting was performed and 300–1500 base-pair fragments were ligated to barcoded adapters unique to each well as previously described . All libraries were sequenced using Complete Genomics’ nanoarray sequencing platform . Both whole genome sequences and experimental phasing were obtained from each single “co-barcoded”  library generated from the LFR process. For seven samples, an additional 5 μg of genomic DNA was used for standard libraries. Standard libraries were processed and sequenced as previously described .
Complete Genomics’ standard analysis pipeline and data formats
LFR analysis pipeline
The unphased fragment map can be obtained from this coverage map using a simple algorithm that looks for streaks of several consecutive, well-populated 1 kb bins in the same well. This process allows for some missing coverage within a fragment due to non-unique regions of the genome that did not generate unique mappings. Fragments shorter than 10 kb are discarded. When this is done, we have a list of start and end positions for each fragment in each well – that is, an unphased fragment map. However, we do not yet know which wells, at a given reference position, contain the mate-pair reads that originated in each of the alleles.
We can use this information to construct a fragment-length distribution, which can be closely approximated by a decaying exponential with a decay length typically between 20 and 50 kb. This is consistent with simple DNA fragmentation models (e.g. breakage occurs with fixed probability at each location and in an uncorrelated fashion), which invariably predict exponential distributions of fragment lengths. A fragment coverage histogram – i.e. the distribution of the number of fragments covering each fragment location – can also be computed from this data. This is well-approximated by a Poisson distribution.
Next, the unphased fragment map is turned into a phased fragment map. This requires a second mapping step to be performed. Instead of mapping to the entire reference genome, we only map mate-pair reads in each well to the subset of the reference that is covered by the fragments in that well as determined by the unphased fragment map. In this second mapping step, only unique mappings with a single mismatch in each arm are allowed. All discordances between the reference and the mate-pair reads are tracked. Most of these are due to errors, but some of them are caused by heterozygous or homozygous single nucleotide polymorphisms (SNPs). To phase the fragments, a set of genomic positions that have strong support for two different base calls is collected. These are locations where a heterozygous SNP is highly likely to be present. Each of these “strong” heterozygous SNPs provides the two bases that correspond to the two alleles. From the second mapping step, we know which wells/fragments contain each of the two bases, and we can use this information to assign each fragment/well to an allele, meaning that we have turned the unphased fragment map into a phased fragment map. In an ideal situation, for a given SNP, all of the fragment/wells containing each of the two bases would be assigned to the same allele. However, because of errors and other artifacts, the well assignments to alleles will include contradictions. The algorithm that performs the allele assignment attempts to minimize the number of contradictions. If too many contradictions are present for a given fragment/well, the fragment is not assigned to any allele. In addition, in regions of low heterozygosity there are not enough SNPs to perform this assignment reliably. This can cause breaks in the allele assignment where it is not possible to relate the allele assignments at a given location to those some distance away. Each of the regions where it is possible to reliably assign the fragments to alleles without breaks is called a phased contig. Phased contigs are typically from a fraction of an Mb to a few Mbs in length.
Here, μ is small (we use μ = 0.01) and reflects the uncertainty of the fragment phasing process. As a result of this change, it is no longer the case that the probability of a hypothesis is invariant when the two alleles are swapped – in other words, hypothesis evaluation automatically takes phasing into account. The sequence optimization process works as in the standard formulation, taking into account this phasing-sensitive hypothesis probability at every step in the process. The benefit of this is two-fold: variant calls are intrinsically phased within each phased contig and the accuracy of variant calls is improved because of the additional information provided by the phased fragment map.
Long Fragment Read-specific fields
LFR phased variants have an ID with this pattern “Phased_#_#_#”, where # is an integer, the first two #s describe unique contigs, and the last # in the series is either 1 or 0 and represents the two possible haplotypes for each contig. All SNPs sharing the same “Phased_#_#_#” are from the same haplotype.
Total number of LFR wells (out of 384) containing sequence reads calling the variant or reference allele. This metric is used to filter polymerase-induced false positive calls as it is unlikely that random polymerase errors will occur in several different wells. A complete explanation of this concept can be found in Peters et al. .
Contains the IDs of the specific wells from which reads calling the variant originate.
At each locus, this is the number of wells that have reads only calling the variant or the reference allele, not both; for true heterozygous variants, this number should be close to “WellCount”.
At each locus, this is the number of wells that contain reads calling both alleles; for true heterozygous variants, this should be low. A high number here suggests mapping errors and for homozygous variants, almost all of the well counts should be in this field.
At each locus, this is the minimum number of exclusive wells (non-shared well counts).
At each locus, this is the maximum number of exclusive wells (non-shared well counts).
Genomic data quality
Long fragment read technology
Personal genome project
Single nucleotide polymorphism
We would like to acknowledge the ongoing contributions and support of all Complete Genomics employees, in particular the many highly skilled individuals who work in the libraries, reagents, and sequencing groups that make it possible to generate high-quality whole genome data.
Availability of data and materials
Read and mapping data for all genomes reported here are available at the database of Genotypes and Phenotypes (dbGaP) under study accession number phs000905.v1.p1 . In addition, this data will be available at the Short Read Archive (SRA) once it is approved through the PGP data release process. The full data package, minus reads and mappings, associated with this publication are accessible through the GigaScience repository, GigaDB , and on the PGP website . Additional file 1 lists the assembly name, PGP participant ID, and GigaDB file name for each genome made available through this data release. Currently 114 genomes are available, and the remaining 70 will be added to this data archive as they are approved through the PGP data release process.
GMC and RD conceived the study. RYZ and RC isolated genomic DNA and generated all of the sequencing libraries. NB managed the sequencing of the libraries and PC developed the algorithms and pipeline necessary for processing LFR libraries. SC, QM, SN, and BAP processed the data. QM and BAP performed genome analyses. BAP, PWE and MPB coordinated the study. AWZ, TC, AC, WV, MRA and MPB managed data loading into the PGP database. BAP wrote the manuscript. All authors read and approved the final manuscript.
Employees of Complete Genomics are shareholders in BGI holdings and Complete Genomics and BGI derive income from whole genome sequencing.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Hayden EC. Technology: the $1,000 genome. Nature. 2014;507(7492):294–5. doi:10.1038/507294a.View ArticlePubMedGoogle Scholar
- Personal Genome Project. http://www.personalgenomes.org/harvard/sign-up.
- Peters BA, Kermani BG, Sparks AB, Alferov O, Hong P, Alexeev A, et al. Accurate whole-genome sequencing and haplotyping from 10 to 20 human cells. Nature. 2012;487(7406):190–5. doi:10.1038/nature11236.View ArticlePubMedPubMed CentralGoogle Scholar
- Dean FB, Hosono S, Fang L, Wu X, Faruqi AF, Bray-Ward P, et al. Comprehensive human genome amplification using multiple displacement amplification. Proc Natl Acad Sci U S A. 2002;99(8):5261–6. doi:10.1073/pnas.082089499.View ArticlePubMedPubMed CentralGoogle Scholar
- Drmanac R, Sparks AB, Callow MJ, Halpern AL, Burns NL, Kermani BG, et al. Human genome sequencing using unchained base reads on self-assembling DNA nanoarrays. Science. 2010;327(5961):78–81. doi:10.1126/science.1181498.View ArticlePubMedGoogle Scholar
- Peters BA, Liu J, Drmanac R. Co-barcoded sequence reads from long DNA fragments: a cost-effective solution for “perfect genome” sequencing. Front Genet. 2014;5:466. doi:10.3389/fgene.2014.00466.PubMedGoogle Scholar
- Complete Genomics Support Documentation. http://www.completegenomics.com/customer-support/documentation/100357139-2/.
- Carnevali P, Baccash J, Halpern AL, Nazarenko I, Nilsen GB, Pant KP, et al. Computational techniques for human genome resequencing using mated gapped reads. J Comput Biol. 2012;19(3):279–92. doi:10.1089/cmb.2011.0201.View ArticlePubMedGoogle Scholar
- Genomes Project C, Auton A, Brooks LD, Durbin RM, Garrison EP, Kang HM, et al. A global reference for human genetic variation. Nature. 2015;526(7571):68–74. doi:10.1038/nature15393.View ArticleGoogle Scholar
- Genome of the Netherlands C. Whole-genome sequence variation, population structure and demographic history of the Dutch population. Nat Genet. 2014;46(8):818–25. doi:10.1038/ng.3021.View ArticleGoogle Scholar
- Consortium UK, Walter K, Min JL, Huang J, Crooks L, Memari Y, et al. The UK10K project identifies rare variants in health and disease. Nature. 2015;526(7571):82–90. doi:10.1038/nature14962.View ArticleGoogle Scholar
- Kong A, Frigge ML, Masson G, Besenbacher S, Sulem P, Magnusson G, et al. Rate of de novo mutations and the importance of father’s age to disease risk. Nature. 2012;488(7412):471–5. doi:10.1038/nature11396.View ArticlePubMedPubMed CentralGoogle Scholar
- Jiang YH, Yuen RK, Jin X, Wang M, Chen N, Wu X, et al. Detection of clinically relevant genetic variants in autism spectrum disorder by whole-genome sequencing. Am J Hum Genet. 2013;93(2):249–63. doi:10.1016/j.ajhg.2013.06.012.View ArticlePubMedPubMed CentralGoogle Scholar
- Francioli LC, Polak PP, Koren A, Menelaou A, Chun S, Renkens I, et al. Genome-wide patterns and properties of de novo mutations in humans. Nat Genet. 2015;47(7):822–6. doi:10.1038/ng.3292.View ArticlePubMedPubMed CentralGoogle Scholar
- Wong WS, Solomon BD, Bodian DL, Kothiyal P, Eley G, Huddleston KC, et al. New observations on maternal age effect on germline de novo mutations. Nat Commun. 2016;7:10486. doi:10.1038/ncomms10486.View ArticlePubMedPubMed CentralGoogle Scholar
- Peters BA, Kermani BG, Alferov O, Agarwal MR, McElwain MA, Gulbahce N, et al. Detection and phasing of single base de novo mutations in biopsies from human in vitro fertilized embryos by advanced whole-genome sequencing. Genome Res. 2015;25(3):426–34. doi:10.1101/gr.181255.114.View ArticlePubMedPubMed CentralGoogle Scholar
- Telenti A, Pierce LT, Biggs WH, di Iulio J, Wong EHM, Fabani MM et al. Deep Sequencing of 10,000 Human Genomes. bioRxiv. 2016. doi:10.1101/061663.
- dbGaP accession. http://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000905.v1.p1.
- Mao Q, Ciotlos S, Zhang RY, Ball MP, Chin R, Carnevali P, Barua N, Nguyen S, Agarwal MR, Clegg T, Connelly A, Vandewege W, Zaranek AW, Estep PW, Church GM, Drmanac R, Peters BA. Supporting data for “The Whole Genome Sequences and Experimentally Phased Haplotypes of over 100 Personal Genomes” GigaScience Database. 2016. http://dx.doi.org/10.5524/100242
- PGP data. https://my.pgp-hms.org/public_genetic_data.
- Zheng X, Levine D, Shen J, Gogarten SM, Laurie C, Weir BS. A high-performance computing toolset for relatedness and principal component analysis of SNP data. Bioinformatics. 2012;28(24):3326–8. doi:10.1093/bioinformatics/bts606.View ArticlePubMedPubMed CentralGoogle Scholar