Secondgeneration PLINK: rising to the challenge of larger and richer datasets
 Christopher C Chang^{1, 2}Email author,
 Carson C Chow^{3},
 Laurent CAM Tellier^{2, 4},
 Shashaank Vattikuti^{3},
 Shaun M Purcell^{5, 6, 7, 8} and
 James J Lee^{3, 9}
DOI: 10.1186/s1374201500478
© Chang et al.; licensee BioMed Central. 2015
Received: 16 October 2014
Accepted: 26 January 2015
Published: 25 February 2015
Abstract
Background
PLINK 1 is a widely used opensource C/C++ toolset for genomewide association studies (GWAS) and research in population genetics. However, the steady accumulation of data from imputation and wholegenome sequencing studies has exposed a strong need for faster and scalable implementations of key functions, such as logistic regression, linkage disequilibrium estimation, and genomic distance evaluation. In addition, GWAS and populationgenetic data now frequently contain genotype likelihoods, phase information, and/or multiallelic variants, none of which can be represented by PLINK 1’s primary data format.
Findings
To address these issues, we are developing a secondgeneration codebase for PLINK. The first major release from this codebase, PLINK 1.9, introduces extensive use of bitlevel parallelism, \(O\left (\sqrt {n}\right)\)time/constantspace HardyWeinberg equilibrium and Fisher’s exact tests, and many other algorithmic improvements. In combination, these changes accelerate most operations by 14 orders of magnitude, and allow the program to handle datasets too large to fit in RAM. We have also developed an extension to the data format which adds lowoverhead support for genotype likelihoods, phase, multiallelic variants, and reference vs. alternate alleles, which is the basis of our planned second release (PLINK 2.0).
Conclusions
The secondgeneration versions of PLINK will offer dramatic improvements in performance and compatibility. For the first time, users without access to highend computing resources can perform several essential analyses of the featurerich and very large genetic datasets coming into use.
Keywords
GWAS Population genetics Wholegenome sequencing Highdensity SNP genotyping Computational statisticsFindings
Because of its broad functionality and efficient binary file format, PLINK is widely employed in dataprocessing pipelines that are established for genetrait mapping and populationgenetic studies. However, the five years since the final firstgeneration update (v1.07), however, have witnessed the introduction of new algorithms and analytical approaches, the growth in size of typical datasets, as well as wide deployment of multicore processors.
In response, we have developed PLINK 1.9, a comprehensive performance, scaling, and usability update. Our data indicate that its speedups frequently exceed two, and sometimes even three, orders of magnitude for several commonly used operations. PLINK 1.9’s core functional domains are unchanged from that of its predecessor—data management, summary statistics, population stratification, association analysis, identitybydescent estimation [1] —and it is usable as a dropin replacement in most cases, requiring no changes to existing scripts. To support easier interoperation with newer software, for example BEAGLE 4 [2], IMPUTE2 [3], GATK [4], VCFtools [5], BCFtools [6] and GCTA [7], features such as the import/export of VCF and Oxfordformat files and an efficient crossplatform genomic relationship matrix (GRM) calculator have been introduced. Most pipelines currently employing PLINK 1.07 can expect to benefit from upgrading to PLINK 1.9.
A major problem remains: PLINK’s core file format can only represent unphased, biallelic data; however we are developing a second update, PLINK 2.0, to address this.
Improvements in PLINK 1.9
Bitlevel parallelism
Modern ×86 processors are designed to operate on data in (usually 64bit) machine word or (≥ 128bit) vector chunks. The PLINK 1 binary file format supports this well: the format’s packed 2bit data elements can, with the use of bit arithmetic, easily be processed 32 or 64 at a time. However, most existing programs fail to exploit opportunities for bitlevel parallelism; instead their loops painstakingly extract and operate on a single data element at a time. Replacement of these loops with bitparallel logic is, by itself, enough to speed up numerous operations by more than one order of magnitude.

If a _{ i }=ϕ or b _{ i }=ϕ, skip

otherwise, if a _{ i }=b _{ i }, increment IBS2

otherwise, if (a _{ i }=2 and b _{ i }=0), or (a _{ i }=0 and b _{ i }=2), increment IBS0

otherwise, increment IBS1
Return \(\frac {0\cdot \text {IBS}0 + 1\cdot \text {IBS}1 + 2\cdot \text {IBS}2}{2\cdot (\text {IBS}0 + \text {IBS}1 + \text {IBS}2)}\)

E:=A _{ i..i+959} XOR B _{ i..i+959}

F:=C _{ i..i+959} AND D _{ i..i+959}

diff := diff + popcount(E AND F)

obs := obs + popcount(F)
Return \(\frac {\text {obs}  \text {diff}}{\text {obs}}\).
The idea is that ({C _{ i }}AND {D _{ i }}) yields a bit vector with two ones for every marker where genotype data is present for both samples, and two 0 s elsewhere, so 2I _{ a,b } is equal to the number of ones in that bit vector; while (({A _{ i }}XOR {B _{ i }})AND {C _{ i }}AND {D _{ i }}) yields a bit vector with a 1 for every nucleotide difference. Refer to Additional file 1 [8] for more computational details. Our timing data (see “Performance comparisons” below) indicate that this algorithm takes less than twice as long to handle a 960marker block as PLINK 1.07 takes to handle a single marker.
Bit population count
The “popcount” function above, defined as the number of ones in a bit vector, merits further discussion. Post2008 x86 processors support a specialized instruction that directly evaluates this quantity. However, thanks to 50 years of work on the problem, algorithms exist which evaluate bit population count nearly as quickly as the hardware instruction while sticking to universally available operations. Since PLINK is still used on some older machines, we took one such algorithm (previously discussed and refined by [9]), and developed an improved SSE2based implementation. (Note that SSE2 vector instructions are supported by even the oldest x8664 processors).
 1.
Set Z := (X OR Y) AND 01010101… _{2}
 2.Evaluate

popcount2(((X XOR Y) AND (10101010… _{2}  Z)) OR Z),
where popcount2() sums 2bit quantities instead of counting set bits. (This is actually cheaper than PLINK’s regular population count; the first step of software popcount() is reduction to a popcount2() problem).

 3.
Subtract the latter quantity from n.
The key insight behind this implementation is that each v _{ i } w _{ i } term is in {−1,0,1}, and can still be represented in 2 bits in an additionfriendly manner. (This is not strictly necessary for bitwise parallel processing—the partial sum lookup algorithm discussed later handles 3bit outputs by padding the raw input data to 3 bits per genotype call—but it allows for unusually high efficiency). The exact sequence of operations that we chose to evaluate the dotproduct terms in a bitwise parallel fashion is somewhat arbitrary.
We note that when computing a matrix of correlation coefficients between all pairs of variants, if no genotype data is absent, then I _{ x,y } is invariant, \(\overline {v}\) and \(\overline {v^{2}}\) do not depend on y, and \(\overline {w}\) and \(\overline {w^{2}}\) do not depend on x. Thus, these five values would not need to be recomputed for each variant pair at O(m ^{2} n) total time cost; they could instead be precomputed outside the main loop at a total cost of O(m n) time and O(m) space. PLINK 1.9 optimizes this common case.
See popcount_longs() in plink_common.c for our primary bit population count function, and plink_ld.c for several correlation coefficient evaluation functions.
Multicore and cluster parallelism
Modern x86 processors also contain increasing numbers of cores, and computational workloads in genetic studies tend to contain large “embarrassingly parallel” steps which can easily exploit additional cores. Therefore, PLINK 1.9 autodetects the number of cores present in the machine it is running on, and many of its heavyduty operations default to employing roughly that number of threads. (This behavior can be manually controlled with the –threads flag.) Most of PLINK 1.9’s multithreaded computations use a simple set of crossplatform C functions and macros, which compile to pthread library idioms on Linux and OS X, and OSspecific idioms like _beginthreadex() on Windows.
PLINK 1.9 also contains improved support for distributed computation: the –parallel flag makes it easy to split large matrix computations across a cluster, while –writevarranges simplifies splitting of pervariant computations.
Graphics processing units (GPUs) remain as a major unexploited computational resource. We have made the development of GPUspecific code a low priority since their installed base is much smaller than that of multicore processors, and the speedup factor over wellwritten multithreaded code running on similarcost, less specialized hardware is usually less than 10x [10,11]. However, we do plan to build out GPU support for the heaviestduty computations after most of our other PLINK 2 development goals are achieved.
Memory efficiency
To make it possible for PLINK 1.9 to handle the huge datasets that benefit the most from these speed improvements, the program core no longer keeps the main genomic data matrix in memory; instead, most of its functions only load data for a single variant, or a small window of variants, at a time. Sample × sample matrix computations still normally require additional memory proportional to the square of the sample size, but –parallel gets around this:
calculates 1/40th of the genomic relationship matrix per run, with correspondingly reduced memory requirements.
Other noteworthy algorithms
Partial sum lookup
 1.
Both genotypes are homozygous for the major allele.
 2.
One is homozygous major, and the other is heterozygous.
 3.
One is homozygous major, and the other is homozygous minor.
 4.
Both are heterozygous.
 5.
One is heterozygous, and the other is homozygous minor.
 6.
Both are homozygous minor.
 7.
At least one genotype is missing.
 1.
\(\frac {(22q)(22q)}{2q(1q)}\)
 2.
\(\frac {(22q)(12q)}{2q(1q)}\)
 3.
\(\frac {(22q)(02q)}{2q(1q)}\)
 4.
\(\frac {(12q)(12q)}{2q(1q)}\)
 5.
\(\frac {(12q)(02q)}{2q(1q)}\)
 6.
\(\frac {(02q)(02q)}{2q(1q)}\)
 7.
0; subtract 1 from the final denominator instead, in another loop
 1.
Initialize all distance/relationship partial sums to zero.
 2.
For each marker, calculate and save the seven possible increments in a lookup table, and then refer to the table when updating partial sums. This replaces several floating point adds/multiplies in the inner loop with a single addition operation.
We can substantially improve on this by handling multiple markers at a time. Since seven cases can be distinguished by three bits, we can compose a sequence of operations which maps a pair of padded 2bit genotypes to seven different 3bit values in the appropriate manner. On 64bit machines, 20 3bit values can be packed into a machine word—for example, let bits 02 describe the relation at marker #0, bits 35 describe the relation at marker #1, and so forth, all the way up to bits 5759 describing the relation at marker #19—so this representation lets us instruct the processor to act on 20 markers simultaneously.
HardyWeinberg equilibrium and Fisher’s exact tests

Its size O(n) memory allocation (where n is the sum of all contingency table entries) could be avoided by reordering the calculation; it is only necessary to track a few partial sums.

Since likelihoods decay supergeometrically as one moves away from the most probable table, only \(O(\sqrt {n})\) of the likelihoods can meaningfully impact the partial sums; the sum of the remaining terms is too small to consistently affect even the 10th significant digit in the final pvalue. By terminating the calculation when all the partial sums stop changing (due to the newest term being too tiny to be tracked by IEEE754 doubleprecision numbers), computational complexity is reduced from O(n) to \(O(\sqrt {n})\) with no loss of precision. See Figure 1 for an example.
Standalone source code for earlytermination SNPHWE and Fisher’s 2×2/ 2×3 exact test is posted at [18]. Due to recent calls for use of midp adjustments in biostatistics [19,20], all of these functions have midp modes, and PLINK 1.9 exposes them.
We note that, while the HardyWeinberg equilibrium exact test is only of interest to geneticists, Fisher’s exact test has wider application. Thus, we are preparing another paper which discusses these algorithms in more detail, with proofs of numerical error bounds and a full explanation of how the Fisher’s exact test algorithm extends to larger tables.
Haplotype block estimation
It can be useful to divide the genome into blocks of variants which appear to be inherited together most of the time, since observed recombination patterns are substantially more “blocklike” than would be expected under a model of uniform recombination [21]. PLINK 1.0’s –blocks command implements a method of identifying these haplotype blocks by Gabriel et al. [22]. (More precisely, it is a restricted port of Haploview’s [23] implementation of the method).
This method is based on 90% confidence intervals (as defined by Wall and Pritchard [21]) for Lewontin’s D ^{′} disequilibrium statistic for pairs of variants. Depending on the confidence interval’s boundaries, a pair of variants is classified as “strong linkage disequilibrium (LD)”, “strong evidence for historical recombination”, or “inconclusive”; then, contiguous groups of variants where “strong LD” pairs outnumber “recombination” pairs by more than 19 to 1 are greedily selected, starting with the longest basepair spans.

Estimation of diplotype frequencies and maximumlikelihood D ^{′} has been streamlined. Bit population counts are used to fill the contingency table; then we use the analytic solution to Hill’s diplotype frequency cubic equation [24,25] and only compute and compare log likelihoods in this step when multiple solutions to the equation are in the valid range.

90% confidence intervals were originally estimated by computing relative likelihoods at 101 points (corresponding to D ^{′}=0,D ^{′}=0.01,…,D ^{′}=1) and checking where the resulting cumulative distribution function (cdf) crossed 5% and 95%. However, the likelihood function rarely has more than one extreme point in (0,1) (and the full solution to the cubic equation reveals the presence of additional extrema); it is usually possible to exploit this unimodality to establish good bounds on key cdf values after evaluating just a few likelihoods. In particular, many confidence intervals can be classified as “recombination” after inspection of just two of the 101 points; see Figure 3.

Instead of saving the classification of every variant pair and looking up the resulting massive table at a later point, we just update a small number of “strong LD pairs within last k variants” and “recombination pairs within last k variants” counts while processing the data sequentially, saving only final haploblock candidates. This reduces the amount of time spent looking up outofcache memory, and also allows much larger datasets to be processed.

Since “strong LD” pairs must outnumber “recombination” pairs by 19 to 1, it does not take many “recombination” pairs in a window before one can prove no haploblock can contain that window. When this bound is crossed, we take the opportunity to entirely skip classification of many pairs of variants.
Most of these ideas are implemented in haploview_ blocks_classify() and haploview_blocks() in plink_ld.c. The last two optimizations were previously implemented in Taliun’s “LDExplorer” R package [26].
Coordinatedescent LASSO
PLINK 1.9 includes a basic coordinatedescent LASSO implementation [27] (–lasso), which can be useful for phenotypic prediction and related applications. See Vattikuti et al. for discussion of its theoretical properties [28].
Newly integrated thirdparty software
PLINK 1.0 commands

Pahl et al.’s PERMORY algorithm for fast permutation testing [29],

Wan et al.’s BOOST software for fast epistasis testing [30],

Ueki, Cordell, and Howey’s –fastepistasis variance correction and jointeffects test [31,32],

Taliun, Gamper, and Pattaro’s optimizations to Gabriel et al.’s haplotype block identification algorithm (discussed above) [26], and

Pascal Pons’s winning submission to the GWAS Speedup logistic regression crowdsourcing contest [33]. (The contest was designed by PoRu Loh, run by Babbage Analytics & Innovation and TopCoder, and subsequent analysis and code preparation were performed by Andrew Hill, Ragu Bharadwaj, and Scott Jelinsky. A manuscript is in preparation by these authors and Iain Kilty, Kevin Boudreau, Karim Lakhani and Eva Guinan.)
In all such cases, PLINK’s citation instructions direct users of the affected functions to cite the original work.
Multithreaded gzip
For many purposes, compressed text files strike a good balance between ease of interpretation, loading speed, and resource consumption. However, the computational cost of generating them is fairly high; it is not uncommon for data compression to take longer than all other operations combined. To make a dent in this bottleneck, we have written a simple multithreaded compression library function based on Mark Adler’s excellent pigz program [34], and routed most of PLINK 1.9’s gzipping through it. See parallel_compress() in pigz.c for details.
Convenience features
Import and export of Variant Call Format (VCF) and Oxfordformatted data

With Oxfordformat files, genotype likelihoods smaller than 0.9 are normally treated as missing calls, and the rest are treated as hard calls. –hardcallthreshold can be used to change the threshold, or request independent pseudorandom calls based on the likelihoods in the file.

Phase is discarded.

By default, when a VCF variant has more than one alternate allele, only the most common alternate is retained; all other alternate calls are converted to missing. –bialleliconly can be used to skip variants with multiple alternate alleles.
Export to these formats is also possible, via –recode vcf and –recode oxford.
Unplaced contig and nonhuman species support
When the –allowextrachr or –aec flag is used, PLINK 1.9 allows datasets to contain unplaced contigs or other arbitrary chromosome names, and most commands will handle them in a reasonable manner. Also, arbitrary nonhuman species (with haploid or diploid genomes) can now be specified with –chrset.
Commandline help
To improve the experience of using PLINK interactively, we have expanded the –help flag’s functionality. When invoked with no parameters, it now prints an entire minimanual. Given keyword(s), it instead searches for and prints minimanual entries associated with those keyword(s), and handles misspelled keywords and keyword prefixes in a reasonable manner.
A comment on withinfamily analysis
Most of our discussion has addressed computational issues. However, there is one methodological issue that deserves a brief comment. The online documentation of PLINK 1.07 weighed the pros and cons of its permutation procedure for withinfamily analysis of quantitative traits (QFAM) with respect to the standard quantitative transmission disequilibrium test (QTDT) [35]. It pointed out that likelihoodbased QTDT enjoyed the advantages of computational speed and increased statistical power. However, a comparison of statistical power is only meaningful if both procedures are anchored to the same Type 1 error rate with respect to the null hypothesis of no linkage with a causal variant, and Ewens et al. has shown that the QTDT is not robust against certain forms of confounding (population stratification) [36]. On the other hand, the validity of a permutation procedure such as QFAM only depends on the applicability of Mendel’s laws. When this nicety is combined with the vast speedup of permutation in PLINK 1.9, a given user may now decide to rate QFAM more highly relative to QTDT when considering available options for withinfamily analysis.
Performance comparisons

“Mac2” denotes a MacBook Pro with a 2.8 Ghz Intel Core 2 Duo processor and 4GB RAM running OS X 10.6.8.

“Mac12” denotes a Mac Pro with two 2.93 Ghz Intel 6core Xeon processors and 64GB RAM running OS X 10.6.8.

“Linux322” denotes a machine with a 2.4 Ghz Intel Core 2 Duo E6600 processor and 1GB RAM running 32bit Ubuntu Linux.

“Linux328” denotes a machine with a 3.4 Ghz Intel Core i73770 processor (8 cores) and 8GB RAM running 32bit Ubuntu Linux.

“Linux64512” denotes a machine with sixtyfour AMD 8core Opteron 6282 SE processors and 512GB RAM running 64bit Linux.

“Win322” denotes a laptop with a 2.4 Ghz Intel Core i52430 M processor (2 cores) and 4GB RAM running 32bit Windows 7 SP1.

“Win642” denotes a machine with a 2.3 Ghz Intel Celeron G1610T processor (2 cores) and 8GB RAM running 64bit Windows 8.

“synth1” refers to a 1000 sample, 100000 variant synthetic dataset generated with HAPGEN2 [37], while “synth1p” refers to the same dataset after one round of –indeppairwise 50 5 0.5 pruning (with 76124 markers remaining). For case/control tests, PLINK 1.9’s –tailpheno 0 command was used to downcode the quantitative phenotype to case/control.

“synth2” refers to a 4000 case, 6000 control synthetic dataset with 88025 markers on chromosomes 1922 generated by resampling HapMap and 1000 Genomes data with simuRare [38] and then removing monomorphic loci. “synth2p” refers to the same dataset after one round of –indeppairwise 700 70 0.7 pruning (with 71307 markers remaining).

“1000g” refers to the entire 1092 sample, 39637448 variant 1000 Genomes project phase 1 dataset [39]. “chr1” refers to chromosome 1 from this dataset, with 3001739 variants. “chr1snp” refers to chromosome 1 after removal of all nonSNPs and one round of –indeppairwise 20000 2000 0.5 pruning (798703 markers remaining). Pedigree information was not added to these datasets before our tests.
All times are in seconds. To reduce diskcaching variance, timing runs are preceded by “warmup” commands like plink –freq. PLINK 1.07 was run with the –noweb flag. “nomem” indicates that the program ran out of memory and there was no lowmemory mode or other straightforward workaround. A tilde indicates that runtime was extrapolated from several smaller problem instances.
Initialization and basic I/O
–freq times (sec)
Dataset  Machine  PLINK 1.07  PLINK 1.90  Ratio 

synth1  Mac2  7.3  0.24  30 
Mac12  6.2  0.18  34  
Linux322  13.1  0.56  23  
Linux328  4.3  0.18  24  
Linux64512  5.4  0.18  27  
Win322  14.3  0.68  21  
Win642  9.6  0.33  29  
synth2  Mac2  43.3  0.84  52 
Mac12  38.2  0.34  110  
Linux322  80.1  1.9  42  
Linux328  25.2  0.53  48  
Linux64512  34.1  0.40  85  
Win322  83.6  1.3  64  
Win642  70.8  0.55  130  
chr1snp  Mac2  52.5  3.5  15 
Mac12  40.5  1.3  31  
Linux322  72.9  10.2  7.15  
Linux328  29.7  1.4  21  
Linux64512  36.8  1.4  26  
Win322  104.3  4.5  23  
Win642  76.8  2.2  35  
chr1  Mac2  403.9  35.0  11.5 
Mac12  163.9  5.3  31  
Linux322  nomem  65.3  
Linux328  134.1  12.8  10.5  
Linux64512  144.7  5.4  27  
Win322  389.2  21.4  18.2  
Win642  285.3  8.1  35 
Identitybystate matrices, complete linkage clustering
The PLINK 1.0 –cluster –matrix flag combination launches an identitybystate matrix calculation and writes the result to disk, and then performs complete linkage clustering on the data; when –ppc is added, a pairwise population concordance constraint is applied to the clustering process. As discussed earlier, PLINK 1.9 employs an XOR/bit population count algorithm which speeds up the matrix calculation by a large constant factor; the computational complexity of the clustering algorithm has also been reduced, from O(n ^{3}) to O(n ^{2} logn). (Further improvement of clustering complexity, to O(n ^{2}), is possible in some cases [40].)
Identitybystate (Hamming distance) and complete linkage clustering times (sec)
Calculation  Dataset  Machine  PLINK 1.07  PLINK 1.90  Ratio 

IBS matrix only  synth1p  Mac2  2233.6  1.9  1.2 k 
Mac12  1320.4  1.2  1.1 k  
Linux328  1937.2  2.8  690  
Linux64512  1492  3.7  400  
Win322  3219.0  7.2  450  
Win642  2674.4  1.5  1.8 k  
synth2p  Mac2  ∼190 k  118.8  1.6 k  
Mac12  ∼99 k  23.5  4.2 k  
Linux328  152.5 k  214.3  710  
Linux64512  ∼98 k  25.3  3.9 k  
Win322  ∼270 k  654.5  410  
Win642  ∼200 k  104.6  1.9 k  
chr1snp  Mac2  ∼26 k  17.5  1.5 k  
Mac12  13.4 k  12.6  1.06 k  
Linux328  18.4 k  30.9  600  
Linux64512  ∼14 k  43.1  320  
Win322  32.7 k  95.9  341  
Win642  ∼26 k  15.3  1.7 k  
Basic clustering  synth1p  Mac2  2315.7  2.7  860 
Mac12  1317.9  2.0  660  
Linux328  1898.7  4.1  460  
Linux64512  1496  4.5  330  
Win322  3301.7  9.1  360  
Win642  2724.5  1.9  1.4 k  
synth2p  Mac2  ∼230 k  245.6  940  
Mac12  ∼140 k  123.9  1.1 k  
Linux328  197.1 k  395.6  498  
Linux64512  ∼125 k  143.3  872  
Win322  ∼440 k  976.7  450  
Win642  ∼270 k  127.9  2.1 k  
chr1snp  Mac2  ∼26 k  18.4  1.4 k  
Mac12  13.6 k  13.5  1.01 k  
Linux328  18.5 k  33.4  554  
Linux64512  ∼14 k  44.2  320  
Win322  33.2 k  95.0  349  
Win642  ∼26 k  15.8  1.6 k  
IBD report  synth1p  Mac2  2230.1  12.4  180 
Mac12  1346.2  2.4  560  
Linux328  2019.9  12.4  163  
Linux64512  1494  5.0  300  
Win322  3446.3  42.2  81.7  
Win642  2669.8  15.1  177  
synth2p  Mac2  ∼190 k  447.1  420  
Mac12  ∼99 k  50.3  2.0 k  
Linux328  161.4 k  618.7  261  
Linux64512  ∼98 k  57.4  1.7 k  
Win322  ∼270 k  1801.1  150  
Win642  ∼200 k  541.0  370  
IBD report  chr1snp  Mac2  ∼26 k  24.8  1.0 k 
Mac12  13.4 k  14.6  918  
Linux328  18.5 k  53.5  346  
Linux64512  ∼14 k  46.5  300  
Win322  33.1 k  199.2  166  
Win642  ∼26 k  25.1  1.0 k 
(Note that newer algorithms such as BEAGLE’s fastIBD [41] generate more accurate IBD estimates than PLINK –Zgenome. However, the –Zgenome report contains other useful information.)
Genomic relationship matrices
GCTA’s –makegrmbin command (–makegrm in early versions) calculates the variancestandardized genomic relationship matrix used by many of its other commands. The latest implementation as of this writing (v1.24) is very fast, but only runs on 64bit Linux, uses single instead of doubleprecision arithmetic, and has a high memory requirement.
Genomic relationship matrix calculation times (sec)
Dataset  Machine  GCTA  PLINK 1.90  Ratio 

synth1p  Mac2  222.2  7.2  31 
Mac12  184.7  5.0  37  
Linux328  248.4  10.9  22.8  
Linux64512  4.4  9.6  0.46  
Win322  373.1  39.3  9.5  
Win642  367.2  6.6  56  
synth2p  Mac2  nomem  805.8  
Mac12  17.0 k  138.3  123  
Linux328  nomem  1153.4  
Linux64512  65.1  318.9  0.20  
Win322  nomem  2007.2  
Win642  nomem  450.1  
chr1snp  Mac2  nomem  87.1  
Mac12  2260.9  50.9  44.4  
Linux328  nomem  94.3  
Linux64512  58.3  91.6  0.64  
Win322  nomem  317.5  
Win642  nomem  65.7 
Linkage disequilibriumbased variant pruning
–indeppairwise runtimes (sec)
Parameters  Dataset  Machine  PLINK 1.07  PLINK 1.90  Ratio 

50 5 0.5  synth1  Mac2  701.3  0.63  1.1 k 
Mac12  569.4  0.55  1.0 k  
Linux328  572.7  0.95  600  
Linux64512  462  0.60  770  
Win322  1163.9  3.2  360  
Win642  1091.9  1.0  1.1 k  
700 70 0.7  synth2  Mac2  ∼120 k  31.9  3.8 k 
Mac12  63.0 k  20.6  3.06 k  
Linux328  57.4 k  66.0  870  
Linux64512  ∼120 k  26.4  4.5 k  
Win322  139.3 k  127.3  1.09 k  
Win642  ∼200 k  22.9  8.7 k  
20000 2000 0.5  chr1  Mac2  nomem  1520.1  
Mac12  nomem  1121.7  
Linux328  nomem  4273.9  
Linux64512  ∼950 k  1553.3  610  
Win322  nomem  4912.7  
Win642  nomem  1205.1  
1000g  Mac2  nomem  20.5 k  
Mac12  nomem  14.5 k  
Linux328  nomem  54.5 k  
Linux64512  ∼13000 k  20.2 k  640  
Win322  nomem  64.5 k  
Win642  nomem  14.7 k 
Haplotype block estimation
–blocks runtimes (sec)
Parameters  Dataset  Machine  Haploview 4.2  PLINK 1.07  PLINK 1.90 

–ldwindowkb 500  synth1  Mac2  nomem  3198.4  1.7 
Mac12  ∼45 k  3873.0  1.3  
Linux322  nomem  5441.1  3.4  
Linux64512  ∼57 k  2323.4  2.9  
Win322  nomem  9803.4  8.9  
Win642  ∼51 k  5513.4  2.8  
–ldwindowkb 1000  synth1  Mac2  nomem  6185.7  2.2 
Mac12  ∼45 k  7394.4  9.8  
Linux322  nomem  9876.8  10.0  
Linux64512  ∼57 k  4462.1  3.9  
Win322  nomem  18925.7  17.3  
Win642  ∼51 k  10.3 k  3.6  
–ldwindowkb 500  chr1  Mac2  nomem  ∼2700 k?  550.9 
Mac12  nomem  ∼3600 k?  426.0  
Linux322  nomem  ∼4300 k?  1288.4  
Linux64512  ∼440 k?  ∼2600 k?  1119.7  
Win322  nomem  ∼17000 k?  4535.8  
Win642  nomem  ∼5700 k?  1037.2 
Association analysis max(T) permutation tests
Association analysis max(T) permutation test times (sec)
Other parameter(s)  Dataset  Machine  PLINK 1.07  PERMORY 1.1  PLINK 1.90  Ratio 

–trend (C/C)  synth1  Mac2  ∼20 k  18.7  1.1 k  
Mac12  ∼16 k  2.8  5.7 k  
Linux322  ∼21 k  65.0  320  
Linux64512  ∼17 k  285.0  2.8  
Win322  ∼35 k  1444.2  61.5  
Win642  ∼25 k  889.7  14.4  
synth2  Mac2  ∼170 k  42.4  4.0 k  
Mac12  ∼180 k  6.4  28 k  
Linux322  ∼410 k  391.0  1.0 k  
Linux64512  ∼200 k  580.9  13.7  
Win322  ∼1100 k  2362.5  198.0  
Win642  ∼370 k  1423.6  34.0  
–fisher (C/C)  synth1  Mac2  ∼150 k  21.9  6.9 k  
Mac12  ∼150 k  3.7  41 k  
Linux322  ∼170 k  57.8  2.9 k  
Linux64512  ∼120 k  3.4  35 k  
Win322  ∼440 k  64.9  6.8 k  
Win642  ∼200 k  22.0  9.1 k  
synth2  Mac2  ∼890 k  49.8  18 k  
Mac12  ∼690 k  7.6  91 k  
Linux322  ∼1300 k  393.7  3.3 k  
Linux64512  ∼720 k  13.0  55 k  
Win322  ∼3600 k  208.3  17 k  
Win642  ∼1700 k  35.6  48 k  
–assoc (QT)  synth1  Mac2  ∼30 k  148.0  200  
Mac12  ∼22 k  22.6  970  
Linux322  ∼68 k  847.2  80  
Linux64512  ∼29 k  29.2  990  
Win322  ∼58 k  896.1  65  
Win642  ∼36 k  264.2  140  
–assoc lin (QT)  synth1  Mac2  606.8  
Mac12  34.7  
Linux322  3212.6  
Linux64512  1259.8  46.4  27.2  
Win322  2115.7  3062.7  0.69  
Win642  972.6  336.6  2.89 
PLINK 2.0 design
Despite its computational advances, we recognize that PLINK 1.9 can ultimately still be an unsatisfactory tool for working with imputed genomic data, due to the limitations of the PLINK 1 binary file format. To address this, we designed a new core file format capable of representing most of the information emitted by modern imputation tools, which is the cornerstone of our plans for PLINK 2.0.
Multiple data representations
As discussed earlier, PLINK 1 binary is inadequate in three ways: likelihoods strictly between 0 and 1 cannot be represented, phase information cannot be stored, and variants are limited to two alleles. This can be addressed by representing all calls probabilistically, and introducing a few other extensions. Unfortunately, this would make PLINK 2.0’s representation of PLINK 1format data so inefficient that it would amount to a serious downgrade from PLINK 1.9 for many purposes.
Therefore, our new format defines several data representations, one of which is equivalent to PLINK 1 binary, and allows different files, or even variants within a single file, to use different representations. To work with this, PLINK 2.0 will include a translation layer which allows individual functions to assume a specific representation is used. As with the rest of PLINK’s source code, this translation layer will be GPLv3licensed open source; and unlike most of the other source code, we are explicitly designing it to be usable as a standalone library. PLINK 2.0 will also be able to convert files/variants from one data representation to another, making it practical for thirdparty tools lacking access to the library to demand a specific representation.
Reference vs. alternate alleles
The nowubiquitous VCF file format requires reference alleles to be distinguished from alternate alleles, and an increasing number of software tools and pipelines do not tolerate scrambling of the two. This presents an interoperability problem for PLINK: while it was theoretically possible to handle binary data with PLINK 1.0 in a manner that preserved the reference vs. alternate allele distinction when it was originally present, with constant use of –keepalleleorder and related flags, doing so was inconvenient and errorprone, especially since the accompanying native.ped/.map and.tped/.tfam text formats had no place to store that information. PLINK 1.9’s –a2allele flag, which can import that information from a VCF file, provides limited relief, but it is still necessary for users to fight against the program’s major/minorallele based design.
We aim to solve this problem for good in PLINK 2.0. The file format explicitly defines reference vs. alternate alleles, and this information will be preserved across runs by default. In addition, the file format will include a flag distinguishing provisional reference allele assignments from those derived from an actual reference genome. When PLINK 2.0 operates on.ped/.map or similar data lacking a reference vs. alternate distinction, it will treat a highestfrequency allele as the reference, while flagging it as a provisional assignment. When a file with flaggedasprovisional reference alleles is merged with another file with unflagged reference alleles, the unflagged reference allele assignments take precedence. (Merges involving conflicting unflagged reference alleles will fail unless the user specifies which source file takes precedence.) It will also be straightforward to import real reference allele assignments with an analogue of –a2allele.
Data compression
PLINK 1.9 demonstrates the power of a weak form of compressive genomics [43]: by using bit arithmetic to perform computation directly on compressed genomic data, it frequently exhibits far better performance than programs which require an explicit decompression step. But its “compressed format” is merely a tight packing which does not support the holy grail of true sublinear analysis.
To do our part to make “strong” sublinear compressive genomics a reality, the PLINK 2 file format will introduce support for “deviations from most common value” storage of lowMAF variants. For datasets containing many samples, this captures much of the storage efficiency benefit of having real reference genomes available, without the drawback of forcing all programs operating on the data to have access to a library of references. Thanks to PLINK 2.0’s translation layer and file conversion facilities, programmers will be able to ignore this feature during initial development of a tool, and then work to exploit it after basic functionality is in place.
We note that LDbased compression of variant groups is also possible, and Sambo’s SNPack software [44] applies this to the PLINK 1 binary format. We do not plan to support this in PLINK 2.0 due to the additional software complexity required to handle probabilistic and multiallelic data, but we believe this is a promising avenue for development and look forward to integrating it in the future.
Remaining limitations
PLINK 2.0 is designed to meet the needs of tomorrow’s genomewide association studies and populationgenetics research; in both contexts, it is appropriate to apply a single genomic coordinate system across all samples, and preferred sample sizes are large enough to make computational efficiency a serious issue.
Wholeexome and wholegenome sequencing also enables detailed study of structural variations which defy clean representation under a single coordinate system; and the number of individuals in such studies is typically much smaller than the tens or even hundreds of thousands which are sometimes required for effective GWAS. There are no plans to make PLINK suitable for this type of analysis; we strongly recommend the use of another software package, such as PLINK/SEQ [45], which is explicitly designed for it. This is why the PLINK 2 file format will still be substantially less expressive than VCF.
An important consequence is that, despite its ability to import and export VCF files, PLINK should not be used for management of genomic data which will be subject to both types of analysis, because it discards all information which is not relevant for its preferred type. However, we will continue to extend PLINK’s ability to interpret VCFlike formats and interoperate with other popular software.
Availability and requirements

Project name: Secondgeneration PLINK

Project (source code) home page: https://www.coggenomics.org/plink2/(https://github.com/chrchang/plinkng)

Operating systems: Linux (32/64bit), OS X (64bit Intel), Windows (32/64bit)

Programming language: C, C++

Other requirements (when recompiling): GCC version 4, a few functions also require LAPACK 3.2

License: GNU General Public License version 3.0 (GPLv3)

Any restrictions to use by nonacademics: none
Availability of supporting data
The test data and the source code snapshots supporting the results of this article are available in the GigaScience repository, GigaDB [8].
Abbreviations
 PLINK:

The software toolset that is the main subject of this paper. The name was originally shorthand for “population linkage”
 BEAGLE:

A software package capable of highaccuracy haplotype phasing, genotype imputation, and identitybydescent estimation, developed by Browning [2]
 GCTA:

Genomewide Complex Trait Analysis. This refers to both the statistical method and the software implementation discussed in [7]
 VCF:

Variant Call Format [5]
 x86:

A family of backward compatible instruction set architectures based on the Intel 8086 CPU
 IBS:

Identitybystate. A simple measure of genomic similarity, equal to the number of identical alleles divided by the number of observations
 popcount:

Bit population count. The number of ’1’ bits in a bit vector
 XOR:

Exclusiveor. A binary logical operation that evaluates to true if exactly one of its arguments is true
 SSE2:

Streaming SIMD Extensions 2. A SIMD (single instruction, multiple data) processor supplementary instruction set first introduced by Intel with the initial version of the Pentium 4 in 2001
 GPU:

Graphics processing unit
 HWE:

HardyWeinberg equilibrium
 SNP:

Singlenucleotide polymorphism
 FEXACT:

A network algorithm for evaluating Fisher’s exact test pvalues, developed by Mehta et al. [15,16]
 LD:

Linkage disequilibrium
 PERMORY:

A software package designed to perform efficient permutation tests for largescale genetic data sets, developed by Pahl et al. [29]
 GWAS:

GenomeWide Association Study
 QFAM:

A familybased quantitative trait association analysis procedure, introduced by PLINK 1.0, which combines a simple linear regression of phenotype on genotype with a special permutation test which corrects for family structure
 QTDT:

Quantitative Transmission Disequilibrium Tests, developed primarily by Abecasis et al. [35]
 Ghz:

Gigahertz
 GB:

Gigabyte
 RAM:

Randomaccess memory
 I/O:

Input/output
 MAF:

Minor allele frequency. Frequency of the least common allele that is still present in a population
 GPLv3:

GNU General Public License version 3
Declarations
Acknowledgements
We thank Stephen D.H. Hsu for helpful discussions. We also continue to be thankful to PLINK 1.9 users who perform additional testing of the program, report bugs, and make useful suggestions.
Christopher Chang and Laurent Tellier were supported by BGI Hong Kong and Shenzhen Municipal Government of China grant CXB201108250094A. Carson Chow and Shashaank Vattikuti were supported by the Intramural Research Program of the NIH, NIDDK.
Authors’ Affiliations
References
 Purcell S, Neale B, ToddBrown K, Thomas L, Ferreira M, Bender D, et al. Plink: A tool set for wholegenome association and populationbased linkage analyses. Am J Hum Genet. 2007; 81:559–75.View ArticlePubMed CentralPubMedGoogle Scholar
 Browning B, Browning S. Improving the accuracy and efficiency of identity by descent detection in population data. Genetics. 2013; 194:459–71.View ArticlePubMed CentralPubMedGoogle Scholar
 Howie B, Donnelly P, Marchini J. A flexible and accurate genotype imputation method for the next generation of genomewide association studies. PLoS Genet. 2009; 5:1000529.View ArticleGoogle Scholar
 McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, et al. The genome analysis toolkit: A mapreduce framework for analyzing nextgeneration dna sequencing data. Genome Res. 2010; 20:1297–303.View ArticlePubMed CentralPubMedGoogle Scholar
 Danecek P, Auton A, Abecasis G, Albers C, Banks E, DePristo M, et al. The variant call format and vcftools. Bioinformatics. 2011; 27:2156–8.View ArticlePubMed CentralPubMedGoogle Scholar
 Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, 1000 Genome Project Data Processing Subgroup, et al. The sequence alignment/map format and samtools. Bioinformatics. 2009; 25:2078–9.View ArticlePubMed CentralPubMedGoogle Scholar
 Yang J, Lee S, Goddard M, Visscher P. Gcta: A tool for genomewide complex trait analysis. Am J Hum Genet. 2011; 88:76–82.View ArticlePubMed CentralPubMedGoogle Scholar
 Chang C, Chow C, Tellier L, Vattikuti S, Purcell S, Lee J. Software and Supporting Material for "Secondgeneration PLINK: Rising to the Challenge of Larger and Richer Datasets". GigaScience Database. http://dx.doi.org/10.5524/100116.
 Dalke A. Update: Faster Population Counts. http://www.dalkescientific.com/writings/diary/archive/2011/11/02/faster_popcount_update.html.
 Lee V, Kim C, Chhugani J, Deisher M, Kim D, Nguyen A, et al. Debunking the 100x gpu vs. cpu myth: an evaluation of throughput computing on cpu and gpu. In: Proceedings of the 37th Annual International Symposium on Computer Architecture: 1923 June 2010. SaintMalo, France,: ACM: 2010. p. 451–460.Google Scholar
 Haque I, Pande V, Walters W. Anatomy of highperformance 2d similarity calculations. J Chem Inf Model. 2011; 51:2345–51.View ArticlePubMedGoogle Scholar
 Hardy H. Mendelian proportions in a mixed population. Science. 1908; 28:49–50.View ArticlePubMedGoogle Scholar
 Wigginton J, Cutler D, Abecasis G. A note on exact tests of hardyweinberg equilibrium. Am J Hum Genet. 2005; 76:887–93.View ArticlePubMed CentralPubMedGoogle Scholar
 Guo S, Thompson E. Performing the exact test of hardyweinberg proportion for multiple alleles. Biometrics. 1992; 48:361–72.View ArticlePubMedGoogle Scholar
 Mehta C, Patel N. Algorithm 643: Fexact: a fortran subroutine for fisher’s exact test on unordered r ×c contingency tables. ACM Trans Math Softw. 1986; 12:154–61.View ArticleGoogle Scholar
 Clarkson D, Fan Y, Joe H. A remark on algorithm 643: Fexact: an algorithm for performing fisher’s exact test in r x c contingency tables. ACM Trans Math Softw. 1993; 19:484–8.View ArticleGoogle Scholar
 Requena F, Martín Ciudad N. A major improvement to the network algorithm for fisher’s exact test in 2 ×c contingency tables. J Comp Stat & Data Anal. 2006; 51:490–8.View ArticleGoogle Scholar
 Chang C. Standalone C/C++ Exact Statistical Test Functions. https://github.com/chrchang/stats.
 Lydersen S, Fagerland M, Laake P. Recommended tests for association in 2 ×2 tables. Statist Med. 2009; 28:1159–75.View ArticleGoogle Scholar
 Graffelman J, Moreno V. The mid pvalue in exact tests for hardyweinberg equilibrium. Stat Appl Genet Mol Bio. 2013; 12:433–48.Google Scholar
 Wall J, Pritchard J. Assessing the performance of the haplotype block model of linkage disequilibrium. Am J Hum Genet. 2003; 73:502–15.View ArticlePubMed CentralPubMedGoogle Scholar
 Gabriel S, Schaffner S, Nguyen H, Moore J, Roy J, Blumenstiel B, et al. The structure of haplotype blocks in the human genome. Science. 2002; 296:2225–9.View ArticlePubMedGoogle Scholar
 Barrett J, Fry B, Maller J, Daly M. Haploview: analysis and visualization of ld and haplotype maps. Bioinformatics. 2005; 21:263–5.View ArticlePubMedGoogle Scholar
 Hill W. Estimation of linkage disequilibrium in randomly mating populations. Heredity. 1974; 33:229–39.View ArticlePubMedGoogle Scholar
 Gaunt T, Rodríguez S, Day I. Cubic exact solutions for the estimation of pairwise haplotype frequencies: implications for linkage disequilibrium analyses and a web tool ’cubex’. BMC Bioinformatics. 2007; 8:428.View ArticlePubMed CentralPubMedGoogle Scholar
 Taliun D, Gamper J, Pattaro C. Efficient haplotype block recognition of very long and dense genetic sequences. BMC Bioinformatics. 2014; 15:10.View ArticlePubMed CentralPubMedGoogle Scholar
 Friedman J, Hastie T, Höfling H, Tibshirani R. Pathwise coordinate optimization. Ann Appl Stat. 2007; 1:302–32.View ArticleGoogle Scholar
 Vattikuti S, Lee J, Chang C, Hsu S, Chow C. Applying compressed sensing to genomewide association studies. GigaScience. 2014; 3:10.View ArticlePubMed CentralPubMedGoogle Scholar
 Steiß V, Letschert T, Schäfer H, Pahl R. Permorympi: A program for highspeed parallel permutation testing in genomewide association studies. Bioinformatics. 2012; 28:1168–9.View ArticlePubMedGoogle Scholar
 Wan X, Yang C, Yang Q, Xue H, Fan X, Tang N, et al. Boost: A fast approach to detecting genegene interactions in genomewide casecontrol studies. Am J Hum Genet. 2010; 87:325–40.View ArticlePubMed CentralPubMedGoogle Scholar
 Ueki M, Cordell H. Improved statistics for genomewide interaction analysis. PLoS Genet. 2012; 8:1002625.View ArticleGoogle Scholar
 Howey R. CASSI: GenomeWide Interaction Analysis Software. http://www.staff.ncl.ac.uk/richard.howey/cassi.
 GWASSpeedup Problem Statement. http://community.topcoder.com/longcontest/?module=ViewProblemStatement&rd=15637&pm=12525.
 Adler M. Pigz: Parallel Gzip. http://zlib.net/pigz/.
 Abecasis G, Cardon L, Cookson W. A general test of association for quantitative traits in nuclear families. Am J Hum Genet. 2000; 66:279–92.View ArticlePubMed CentralPubMedGoogle Scholar
 Ewens W, Li M, Spielman R. A review of familybased tests for linkage disequilibrium between a quantitative trait and a genetic marker. PLoS Genet. 2008; 4:1000180.View ArticleGoogle Scholar
 Su Z, Marchini J, Donnelly P. Hapgen2: Simulation of multiple disease snps. Bioinformatics. 2011; 27:2304–5.View ArticlePubMed CentralPubMedGoogle Scholar
 Xu Y, Wu Y, Song C, Zhang H. Simulating realistic genomic data with rare variants. Genet Epidemiol. 2013; 37:163–72.View ArticlePubMed CentralPubMedGoogle Scholar
 The 1000 Genomes Project Consortium. An integrated map of genetic variation from 1,092 human genomes. Nature. 2012; 491:56–65.View ArticlePubMed CentralGoogle Scholar
 Defays D. An efficient algorithm for a complete link method. Comput J. 1977; 20:364–6.View ArticleGoogle Scholar
 Browning B, Browning S. A fast, powerful method for detecting identity by descent. Am J Hum Genet. 2011; 88:173–82.View ArticlePubMed CentralPubMedGoogle Scholar
 Browning B. Presto: rapid calculation of order statistic distributions and multipletesting adjusted pvalues via permutation for one and twostage genetic association studies. BMC Bioinformatics. 2008; 9:309.View ArticlePubMed CentralPubMedGoogle Scholar
 Loh P, Baym M, Berger B. Compressive genomics. Nat Biotechnol. 2012; 30:627–30.View ArticlePubMedGoogle Scholar
 Sambo F, Di Camillo B, Toffolo G, Cobelli C. Compression and fast retrieval of snp data. Bioinformatics. 2014; 30:495.View ArticleGoogle Scholar
 PLINK/SEQ: A Library for the Analysis of Genetic Variation Data. https://atgu.mgh.harvard.edu/plinkseq/.
Copyright
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.