Applying compressed sensing to genomewide association studies
 Shashaank Vattikuti^{1},
 James J Lee^{1, 2, 5},
 Christopher C Chang^{3, 5},
 Stephen D H Hsu^{4, 5}Email author and
 Carson C Chow^{1}Email author
DOI: 10.1186/2047217X310
© Vattikuti et al.; licensee BioMed Central Ltd. 2014
Received: 8 January 2014
Accepted: 23 May 2014
Published: 16 June 2014
Abstract
Background
The aim of a genomewide association study (GWAS) is to isolate DNA markers for variants affecting phenotypes of interest. This is constrained by the fact that the number of markers often far exceeds the number of samples. Compressed sensing (CS) is a body of theory regarding signal recovery when the number of predictor variables (i.e., genotyped markers) exceeds the sample size. Its applicability to GWAS has not been investigated.
Results
Using CS theory, we show that all markers with nonzero coefficients can be identified (selected) using an efficient algorithm, provided that they are sufficiently few in number (sparse) relative to sample size. For heritability equal to one (h ^{ 2 } = 1), there is a sharp phase transition from poor performance to complete selection as the sample size is increased. For heritability below one, complete selection still occurs, but the transition is smoothed. We find for h ^{ 2 } ∼ 0.5 that a sample size of approximately thirty times the number of markers with nonzero coefficients is sufficient for full selection. This boundary is only weakly dependent on the number of genotyped markers.
Conclusion
Practical measures of signal recovery are robust to linkage disequilibrium between a true causal variant and markers residing in the same genomic region. Given a limited sample size, it is possible to discover a phase transition by increasing the penalization; in this case a subset of the support may be recovered. Applying this approach to the GWAS analysis of height, we show that 70100% of the selected markers are strongly correlated with heightassociated markers identified by the GIANT Consortium.
Keywords
GWAS Genomic selection Compressed sensing Lasso Underdetermined system Sparsity Phase transitionBackground
The search for genetic variants associated with a given phenotype in a genomewide association study (GWAS) is a classic example of what has been called a p ≫ n problem, where n is the sample size (number of subjects) and p is the number of predictor variables (genotyped markers) [1]. Estimating the partial regression coefficients of the predictor variables by ordinary least squares (OLS) requires that the sample size exceed the number of coefficients, which in the GWAS context, may be of order 10^{5} or even 10^{6}. The difficulty of assembling such large samples has been one obstacle hindering the simultaneous estimation of all regression coefficients advocated by some authors [2–4].
The typical procedure in GWAS is to estimate each coefficient by OLS independently and retain those meeting a strict threshold; this approach is sometimes called marginal regression (MR) [5]. Although the implementation of MR in GWAS has led to an avalanche of discoveries [6], it is uncertain whether it will be optimal as datasets continue to increase in size. Many genetic markers associated with a trait are likely to be missed because they do not pass the chosen significance threshold [7].
Unlike MR, which directly estimates whether each coefficient is nonzero, an L _{1}penalization algorithm, such as the lasso, effectively translates the estimates toward the origin where many are truncated out of the model [8]. If the number of variants associated with a typical complex trait is indeed far fewer than the total number of polymorphic sites [9–11], then it is reasonable to believe that L _{1} penalization will at least be competitive with MR. Methods relying on the assumption of sparsity (few nonzero coefficients relative to sample size) have in fact been adopted by workers in the field of genomic selection (GS), which uses genetic information to guide the artificial selection of livestock and crops [12–15]. Note that the aim of GS (phenotypic prediction) is somewhat distinct from that of GWAS (the identification of markers tagging causal variants). The lasso is one of the methods studied by GS investigators [16, 17], although Bayesian methods that regularize the coefficients with strong priors tend to be favored [18, 19].
In this paper we show that theoretical results from the field of compressed sensing (CS) supply a rigorous quantitative framework for the application of regularization methods to GWAS. In particular, CS theory provides a mathematical justification for the use of L _{1}penalized regression to recover sparse vectors of coefficients and highlights the difference between selection of the markers with nonzero coefficients and the fitting of the precise coefficient values. CS theory also addresses the robustness of L _{1} algorithms to the distribution of nonzero coefficient magnitudes.
Besides supplying a rule of thumb for the sample size sufficing to select the markers with true nonzero coefficients, CS gives an independent quantitative criterion for determining whether a given dataset has, in fact, attained that sample size. Whereas biological assumptions regarding the number of nonzeros do enter into the rule of thumb about sample size, these assumptions need not hold for the use of L _{1} penalization to be justified; this is because the returned results themselves inform the investigator whether the assumptions are met.
We emphasize that CS is not a method per se, but may be considered a general theory of regression that takes into account model complexity (sparsity). The theory is still valid in the classical regression domain of n > p but establishes conditions for when full recovery of nonzero coefficients is still possible when n < p[20–22]. Our work therefore should not be directly compared to recent literature proposing and evaluating GS methods [18, 19]. Rather, our goal is to elucidate properties of wellknown methods, already in use by GWAS and GS researchers, whose mathematical attributes and empirical prospects may be insufficiently appreciated.
Using more than 12,000 subjects from the Atherosclerosis Risk in Communities Study (ARIC) European American and GeneEnvironment Association Studies (GENEVA) cohorts and nearly 700,000 singlenucleotide polymorphisms (SNPs), we show that the matrix of genotypes acquired in GWAS obeys properties suitable for the application of CS theory. In particular, a given sample size determines the maximum number of nonzeros that will be fully selected using an L _{1}penalization regression algorithm. If the sample size is too small, then the complete set of nonzeros will not be selected. The transition between poor and complete selection is sharp in the noiseless case (narrowsense heritability equal to one). It is smoothed in the presence of noise (heritability less than one), but still fully detectable. Consistent with CS theory, we find in cases with realistic residual noise that the minimal sample size for full recovery is primarily determined by the number of nonzeros, depends very weakly on the number of genotyped markers [22–24], and is robust to the distribution of coefficient magnitudes [25].
Theory of compressed sensing
Where y ∈ ℝ^{ n } is the vector of phenotypes, A ∈ ℝ^{ n xp } is the matrix of standardized genotypes, x ∈ ℝ^{ p } is the vector of partial regression coefficients, and e ∈ ℝ^{ n } is the vector of residuals. In the CS literature, A is often called the sensing or measurement matrix. Standardizing A does not affect the results and makes it simpler to utilize CS theory. We suppose that x contains s nonzero coefficients (“nonzeros”) whose indices we wish to know.
Our results rely on two lines of research in the field of CS, which we summarize as two propositions.
Proposition 1[20, 24, 26, 27]Suppose that the entries of the sensing matrix A are drawn from independent normal distributions and e is the zero vector (noiseless case). Then the ρ − δ plane is partitioned by a curve $\mathit{\rho}={\mathit{\rho}}_{{\mathit{L}}_{1}}\left(\mathit{\delta}\right)$ into two phases. Below the curve the solution of $\underset{\widehat{\mathit{x}}}{\mathit{min}}\phantom{\rule{0.5em}{0ex}}{\u2225\phantom{\rule{0.5em}{0ex}}\widehat{\mathit{x}}\phantom{\rule{0.5em}{0ex}}\u2225}_{{\mathit{L}}_{1}}$ subject to $\mathit{A}\phantom{\rule{0.25em}{0ex}}\widehat{\mathit{x}}=\mathit{y}$ leads to $\widehat{\mathit{x}}=\mathit{x}$ with probability converging to one as n, p, s → ∞ in such a way that ρ and δ remain constant. Above the curve $\widehat{\mathit{x}}\ne \mathit{x}$ with similarly high probability.
Most phenotypes do not have a heritability of one and are therefore, not noiseless, but CS theory shows that selection is still possible in this situation. Before stating the relevant CS result, we need to define two quantities characterizing the genotype matrix A.
Definition 1[22]The matrix A satisfies isotropy if the expectation value of A ’ A is equal to the identity matrix.
In the context of GWAS, a matrix of gene counts is isotropic if all markers are in linkage equilibrium (LE).
Thus, a matrix of genotypes is reasonably incoherent if the magnitudes of the matrix elements do not differ greatly from each other. In the GWAS context, A will be reasonably incoherent if all markers with very low minor allele frequency (MAF) are pruned, since A is standardized and the standard deviation scales with MAF.
We can now state
where ${\mathit{\sigma}}_{\mathit{E}}^{2}$ is the variance of the residuals in e .
Two features of Proposition 2 are worth noting. First, no strong restrictions on x are required. Second, the critical threshold value of n depends linearly on s, but only logarithmically on p. For n larger than the critical value, the deviations of the estimated coefficients from the true values will follow the expected OLS scaling of $1/\sqrt{\mathit{n}}$.
These results are more powerful than they might seem from the restrictive hypotheses required for brief formulations. For example, it has been shown that a curve similar to that in Proposition 1 also demarcates a phase transition in the case of e ≠ 0 — although, as might be expected from a comparison of Propositions 1 and 2, with large residual noise the transition is to a regime of gradual improvement with n rather than to instantaneous recovery [24, 28]. A remarkable feature of this gradual improvement, however, should be noted. Proposition 2 states that the scaling of the total fitting error in the favorable regime is within a polylogarithmic factor of what would have been achieved if the identities of the s nonzeros had been revealed in advance by an oracle. This result implies that perfect selection of nonzeros can occur before the magnitudes of the coefficients are well fit. Even if the residual noise is substantial enough to prevent the sharp transition from large to negligible fitting error evident in Figure 1A, the total magnitude of the error in the favorable phase is little larger than what would be expected given perfect selection of the nonzeros.
Recent work has also generalized the sensing matrix, A, in Proposition 1 to several nonnormal distributions (although not to genotype matrices per se) [27, 29]. Furthermore, the form of Proposition 2 also holds under a weaker form of isotropy that allows the expectation of A’A to differ from the identity matrix by a small quantity (see [22] for the specification of the matrix norm). The latter generalization is promising because the covariance matrix in GWAS deviates toward blockdiagonality as a result of linkage disequilibrium (LD) among spatially proximate variants.
Whereas the penalization parameter λ in Proposition 2 is often determined empirically through crossvalidation, CS places a theoretical lower bound on its value that is based on the magnitude of the noise [22] (referred here as λ _{ min } or λ). A special feature of the GWAS context is that an estimate of the residual variance can be obtained from the genomicrelatedness method [7, 30–32], thereby enabling the substitution of a theoretical noisedependent bound for empirical crossvalidation. Such noisedependent bounds appear in other selection theories, including MR, and thus are not specific to CS [5, 33]. As noted by [33], such bounds tend to be conservative. Here, we show that the CS noisedependent bound demonstrates good selection properties. A dataspecific method, such as crossvalidation may exhibit slightly better properties, but is computationally more expensive.
 1.
Does the matrix of genotypes A in the GWAS setting fall into the class of matrices exhibiting the CS phase transition across the curve ${\mathit{\rho}}_{{\mathit{L}}_{1}}\left(\mathit{\delta}\right),$ as described by Proposition 1?
 2.
Since large residual noise is typical, we must also ask: is A sufficiently isotropic and incoherent to make the regime of good performance described by Proposition 2 practically attainable? Since log p slowly varies over the relevant range of p we can absorb γ and log p into the constant factor and phrase the question more provocatively: given that n > Cs is required for good recovery, what is C?
 3.
In practice, a measure of recovery relying on the unknown x, such as a function of ${\u2225\widehat{\mathbf{x}}\phantom{\rule{0.5em}{0ex}}\phantom{\rule{0.5em}{0ex}}\mathbf{x}\u2225}_{{\mathit{L}}_{2}}$, cannot be used. Is there a measure of recovery, then, that depends solely on observables?
The aim of the present work is to answer these three questions.
Data description
All participants gave informed consent. All studies were approved by their appropriate Research Ethics Committees.
We used the ARIC and GENEVA European American cohort. The datasets were obtained from dbGaP through dbGaP accession numbers [ARIC:phs000090] and [GENEVA:phs000091] [34]. The ARIC population consists of a large sample of unrelated individuals and some families. The population was recruited in 1987 from four centers across the United States: Forsyth County, North Carolina; Jackson, Mississippi; Minneapolis, Minnesota; and Washington County, Maryland.
The ARIC subjects were genotyped with the Affymetrix Human SNP Array 6.0. We selected biallelic autosomal markers based on a HardyWeinberg equilibrium tolerance of P < 10^{− 3}. Preprocessing was performed with PLINK 2 [35, 36].
The datasets were merged to create a SNP genotype matrix (A) consisting of 12,464 subjects and 693,385 SNPs. SNPs were coded by their minor allele, resulting in values of 0, 1, or 2. Each column of A was standardized to have zero mean and unit variance. Missing genotypes were replaced with the mean (i.e., zero) after standardization. We compared results for the phase transition for a limited number of cases when the missing genotypes were imputed based on sampling from a Binomial distribution and the respective minor allele frequency. We found no difference between the imputation methods for our datasets.
We simulated phenotypes according to Equation 1, rescaling each term to leave the phenotypic variance equal to unity and the variance of the breeding values in Ax to match the target narrowsense heritability h ^{2}, which is the proportion of phenotypic variance due to additive genetic factors. For standardized phenotypes, h ^{2} is equivalent to the additive genetic variance, which is defined to equal one in the noiseless case. We chose h ^{2} = 0.5 to represent the noisy case because many human traits show a SNPbased heritability close to this value [7, 30, 37].
The magnitudes of the s nonzeros in x were drawn from either the set {−1, 1} or hyperexponential distributions. We defined two hyperexponential distributions (Hyperexponential 1 and 2) and each was generated by summing two exponentials with the same amplitude, but different decay constants. The pair of decay constants for Hyperexponential 1 were 0.05s and p, and that of Hyperexponential 2 were 0.2s and p. The coefficients were then truncated to keep only the top s nonzero coefficients, the rest were made zero, and 50% of the nonzeros had negative signs. The hyperexponential form was motivated by [38], but the decay constants were arbitrarily chosen. For all coefficient ensembles, the nonzeros were randomly distributed among the SNPs. When examining the dependence of an outcome on n, p, and s the set p was either chosen randomly across the genome without replacement or restricted to all chromosome 22 SNPs, and n and s were randomly sampled without replacement. A single set of SNPs was used for all analyses of the genomic random p set.
We also considered a real phenotype (height) rather than a simulated one, using 12,454 subjects with measurements of height adjusted for sex. We examined different values of n and fixed p by always using all markers in our dataset. A called nonzero was counted as a true positive in the numerator of our “adjusted positive predictive value” (to be defined later) if the marker was a member of a proxy set based on heightassociated SNPs discovered by the GIANT Consortium [39]. The set was generated using the BROAD SNAP database [40]. We based our proxy criterion on basepair distance rather than LD, as we found the correlations between SNPs in our dataset to be larger in magnitude than those recorded in the SNAP database. We generated a proxy list based on a maximum basepair distance of 500 kb, which was the maximum distance that could be queried.
Analysis
Phase transition to complete selection
We first studied the case of independent markers to gain insight into the more realistic case of LD among spatially proximate markers [17, 41]. In the noiseless case (e = 0), it has been proven that there is a universal phase transition boundary between poor and complete selection in the ρ − δ plane (Proposition 1) [20, 24, 26, 27]. The existence of this boundary is largely independent of the explicit values of s, n, and p for a large class of sensing matrices, including sensing matrices generated by the multivariate normal distribution. However, the transition boundary does depend, on certain properties of the distribution describing the coefficients. For example, the boundary can depend critically on whether the coefficients are all positive or can have either sign, although the particular form of the distribution within either of these two broad classes is less important. Genetic applications typically have realvalued coefficients, which are in the same class (i.e., in terms of phase transition properties) as coefficients drawn from the set {−1, 1} [25, 42], which we used in the majority of our simulations. We also studied selection performance when the coefficients are hyperexponentially distributed (see Data Description).
The phase transition can be explored using multiple measures of selection quality. Figure 1A shows the normalized error (NE) (Equation 5) of the coefficient estimates returned by the L _{1}penalized regression algorithm in our study of a simulated phenotype and a random selection of SNPs ascertained in a real GWAS for the noiseless case. The boundary between poor and good performance, as evidenced by this measure, was well approximated by the theoretically derived curve [26], confirming that a matrix of independent SNPs ascertained in GWAS qualifies as a CS sensing matrix.
The noiseless case corresponds to a trait with a perfect narrowsense heritability (h ^{2} = 1). Although there are some phenotypes that approach this ideal situation, it is important to consider the more typical situation of h ^{2} < 1. Figure 1B shows how the NE varied in the presence of a noise level corresponding to h ^{2} = 0.5 (which is roughly the SNPbased heritability of height [7, 30]). We can see that the transition boundary was smoothed and effectively shifted downward.
In the noisy case, the transition boundary was less dependent on δ than in the noiseless case. Note that in Figure 1AB the noise variance is fixed by h ^{2}, but ρ and δ are both functions of the sample size. Fixing ρ and traversing the phase plane horizontally can be interpreted as using a sample of size n to study a particular phenotype with s nonzeros, changing the number of genotyped markers in successive assays; Figure 1B shows that in the noisy case an orderofmagnitude change in p had a negligible impact on the quality of selection.
Given this insensitivity to δ, it is instructive to increase the resolution with which the phase transition can be studied by fixing δ and then comparing the h ^{2} = 1 and h ^{2} = 0.5 cases. Figure 1C shows that the NE approached its asymptote beyond the theoretical phase transition in both cases. Moreover, the asymptote appeared to be greater than zero in the noiseless case. This behavior may suggest that the noisedependent λ _{min} prescribed by CS theory is suboptimal when noise is in fact absent; although the closeness of the theoretical and empirical phase boundaries implies that the deviation from optimality is mild. The transition was not altered in the noiseless case when λ _{min} was estimated using crossvalidation, although there was some improvement in the noisy case. A 10fold crossvalidation increased the computational time by 10 to 100fold. The similar quality of selection achieved by the theoretical λ _{min} and the use of crossvalidation supports the theoretical estimate.
In the noiseless case, when using a criterion of NE < 0.5, the phase transition to vanishing NE began at ρ ≈ 0.4. In the noisy case of h ^{2} = 0.5, the phase transition began at ρ ≈ 0.03 (n ≈ 30s). As expected, the sample size for a given number of nonzero coefficients must be larger in the presence of noise.
Measures of selection
 1.
The false positive rate (FPR), the fraction of true zerovalued coefficients that are falsely identified as nonzero.
 2.
The positive predictive value (PPV), the number of correctly selected true nonzeros divided by the total number of nonzeros returned by the selection algorithm. 1 − PPV equals the false discovery rate (FDR).
 3.
The median of the Pvalues obtained when regressing the phenotype on each of the L _{1}selected markers in turn (μ _{ P − value}). Each such Pvalue is the standard twotailed probability from the t test of the null hypothesis that a univariate regression coefficient is equal to zero. The previous measures of recovery—NE, FPR, PPV—cannot be computed in realistic applications because they depend on the unknown x, and thus it is of interest to examine whether an observable quantity such as μ _{ P − value} also undergoes a phase transition at the same critical sample size.
We hypothesized that a measure of the Pvalue distribution of the putative nonzero set may reflect the phase transition since the distribution of Pvalues of normally distributed random variables is uniform and is the basis of false discovery approaches for the multiple comparisons problem [43].
The greater robustness of the FPR, PPV and μ _{ P − value} against residual variance relative to the NE shows that accurate selection of nonzeros can occur well before the precise fitting of their coefficient magnitudes. The fact that the observable quantity μ _{ P − value} exhibits this robustness is particularly important; a steep decline in μ _{ P − value} across subsamples of increasing size drawn from a given dataset demonstrates a transition to good recovery and implies that the full dataset has sufficient power for accurate identification. This is an empirical finding that deserves further investigation.
For h ^{2} = 0.5 and across all measures of performance other than the NE, the transition appeared to be around n = 5,000. Given s = 125 and p = 8,027, this corresponds to (ρ = 0.025, δ = 0.625), which is circled in Figure 1B. This estimate of the critical ρ is consistent with our previous estimate when δ was fixed at 0.5, supporting the weak dependence on p.
Quality of selection in the presence of LD
We have shown that randomly sampled SNPs from a GWAS of Europeans have the properties of a compressed sensor. This was expected, given that randomly sampled markers will be mostly uncorrelated and therefore closely estimate an isotropic matrix.
The relatively poor performance of the PPV and FPR in the case of LD is somewhat misleading. For example, an “offbyone” (nearby) nonzero called by L _{1}penalized regression will not count toward the numerator of the PPV, even if it is in extremely strong LD with a true nonzero. At the same time, such a near miss does count toward the numerator of the FPR This standard of recovery quality seems overly stringent when we recall that picking out the causal variant from a GWAS “hit” region containing multiple marker SNPs in LD continues to be a challenge for the standard MR approach [44, 45].
Sensitivity to the distributions of coefficient magnitudes and MAF
Selection of SNPs associated with height
Motivated by the results above, we examined whether the full sample size of 12,454 subjects was sufficient to achieve the phase transition from poor to good recovery of SNPs associated with a real phenotype (height). We considered the selection measures μ _{ P − value} and adjusted the positive predictive value (PPV*); the latter extended truepositive status to any selected SNP within 500 kb of a SNP identified as a likely marker of a heightaffecting variant in the GIANT Consortium’s analysis of ~ 180,000 unrelated individuals [39]. This extension is consistent with the rule of thumb designating a 1Mb region as a “locus” for purposes of counting the number of GWAS “hits” [48]. The relative insensitivity of μ _{ P − value} to LD suggests that PPV* rewards the identification of both true nonzeros and markers tagging nonzeros; we therefore substituted PPV* for PPV in an attempt to align the phase dynamics of our precision measure with those of μ _{ P − value}. Whether a selected marker fell within 500 kb of a GIANTidentified marker was determined by consulting the Broad Institute’s SNAP database [40].
The penalization parameter λ was set using CS theory to minimize NE error based on the expected noiselevel from reported narrow sense heritability for height [7, 30]. If λ is set too low, then more false positives are expected; if λ is set too high, then true nonzeros will be missed. According to CS theory, an L _{1}penalized method can still select some of the largest coefficients from a nonuniform distribution of coefficient magnitudes even if complete recovery is out of reach [49]. We investigated whether it was possible to achieve a phase transition to low μ _{ P − value} and high PPV*, at the cost of recovering only a small fraction of all true nonzeros, by increasing the penalty parameter λ. More specifically, we set λ to a higher value consistent with h ^{2} = 0.01 rather than 0.5. In this case, the L _{1} algorithm returned 20 putative nonzeros rather than the original 403, and both μ _{ P − value} and PPV* exhibited better performance (Figure 9B). Compared to the less stringent λ, PPV* as a function of n was less smooth, but appeared to stabilize to a high recovery value after ∼ 7000 subjects. Evidently, if the sample size does not suffice to capture the full heritability, setting the penalty parameter to a value appropriate for a lower heritability can lead to a smaller set of selected markers characterized by good precision.
A search for a phase transition can be a useful approach to determining the optimal P value threshold in standard GWAS protocols employing MR. In addition to a priori assumptions regarding the likely number of true nonzeros and their coefficient magnitudes [38, 50] and agreement between studies of different designs [51], GWAS investigators might rely on whether a measure such as ${\mathit{\mu}}_{\mathit{P}\mathrm{value}}^{*}$ undergoes a clear phase transition as they take increasingly large subsamples of their data. A majority of markers surviving the most liberal significance threshold bounding the second phase are likely to be true positives.
Discussion
Our results with real European GWAS data and simulated vectors of regression coefficients demonstrate the accurate selection of those markers with nonzero coefficients, consistent with CS sample size requirements (n) for a given sparsity (s) and total number of predictors (p). We found that the matrix of standardized genotypes exhibits the theoretical phase transition between poor and complete selection of nonzeros (Proposition 1). We also found, as for Gaussian random matrices in earlier studies, that the phase transition depends on the scaling ratios ρ = s/n and δ = n/p[42].
We obtained results regarding the effect of noise (i.e., h ^{2} < 1) that are consistent with earlier empirical studies of random matrices and recently proven theorems [22, 24, 28]. Generally speaking, we show that the critical sample size is determined mainly by the ratio of s to n and only weakly sensitive to p, particularly as noise increases. For example, if h ^{2} = 0.5, which is roughly the narrowsense heritability of height and a number of other quantitative traits [7, 30, 37], we find that ρ should be less than approximately 0.03 for recovery irrespective of δ. There is no hope of recovering the complete vector of coefficients x above this threshold (i.e., smaller sample sizes). For example, if we have prior knowledge that s = 1, 200, then this means that the sample size should be no less than 40,000 subjects. We find empirically that for h ^{2} ∼ 0.5, n ∼ 30s is sufficient for selection of the nonzeros.
In real problems we cannot rely on measures of model recovery based on the unknown x. Hence, we introduced a new measure based on the median P value of the L _{1} selected nonzeros, μ _{ P − value}. We found that μ _{ P − value} provides a robust means of detecting the boundary between poor and good recovery. Proposition 2 shows that the recovery error NE in the favorable phase scales with ρ and noise; however, we observed that the recovery measures FPR, PPV and μ _{ P − value} approached zero faster than the NE, confirming that accurate identification of nonzeros can occur well before precise estimation of their magnitudes.
An L _{1} penalized regression algorithm is equivalent to linear regression with a Laplace prior distribution of coefficients, and in theory a Bayesian method invoking a prior distribution better matching the unknown true distribution of nonzero coefficients should outperform the lasso in effect estimation. However, it is by no means clear that the performance of L _{1} penalization with respect to selection can be bettered. For example, the lasso and BayesB display rather similar performance properties [17]. However, both methods clearly outperformed ridge regression (a non L _{1} method), which exhibited no phase transition away from poor performance. Furthermore, it is usually accepted by GWAS researchers that knowledge of the markers with nonzero coefficients may be quite valuable, even if the actual magnitudes of the coefficients are not well determined. Combining the advantages of different approaches by applying one of them to the L _{1} selected markers is a possibility.
Perhaps contrary to intuition, but consistent with theoretical results for CS [25, 42], we found that the phase transition to good recovery (at least as measured by μ _{ P − value}) was insensitive to the distribution of coefficient magnitudes. It is well known in CS that L _{1} penalized regression is nearly minimax optimal (minimizes the error of the worst case), and that the phase transition is robust to the distribution of coefficient magnitudes. In some cases a good prior may reduce the meansquare error and shift the location of the phase transition [52]. However, simulations supporting this notion, were performed with a much higher signaltonoise ratio (SNR) than hypothesized for realistic GWAS problems. The performance increase was attenuated as the SNR was decreased to levels still higher than usual in GWAS (10 dB or h ^{2} > 0.9 where SNR on the dB scale is given by $10\cdot {log}_{10}\left(\frac{{\mathit{\sigma}}_{\mathit{A}}^{2}}{{\mathit{\sigma}}_{\mathit{E}}^{2}}\right)$). These algorithms are currently being explored in lowerSNR regimes. We observed that crossvalidation did slightly affect the phase transition boundary in the noisy case; nevertheless the theoretical penalization parameter proved to be a good rule of thumb for initial screening. Calculating the theoretical penalty depends on knowledge of h ^{2}, which may be estimated using the genomicrelatedness method [7, 30–32].
Genomic selection methods have been criticized by researchers who doubt that the number of nonzeros (s) will typically be smaller than a practically attainable sample size (n) [19]. The application of CS theory circumvents this problem because it allows the optimization method to selfdetermine whether or not the nonzero markers are sufficiently sparse compared to the sample size. No prior assumptions are required. Furthermore, there is evidence that a number of traits satisfy the sparsity assumption in humans, at least with respect to common variants contributing to heritability [9–11].
CS theory does not provide performance guarantees in the presence of arbitrary correlations (LD) among predictor variables: it must be verified empirically, as we have done. In agreement with previous results [17], we find that the phase transition, as measured by NE, is strongly affected by LD. However, according to our simulations using all genotyped SNPs on chromosome 22, L _{1} penalized regression does select SNPs in close proximity to true nonzeros. The difficulty of finemapping an association signal to the actual causal variant is a limitation shared by all statistical genemapping approaches—including marginal regression as implemented in standard GWAS—and thus should not be interpreted as a drawback of L _{1} methods.
We found that a sample size of 12,464 was not sufficient to achieve full recovery of the nonzeros with respect to height. However, the penalization parameter λ is set by CS theory so as to minimize the NE based on the expected noiselevel. In some situations it might be desirable to tolerate a relatively large NE in order to achieve precise, but incomplete recovery (few false positives, many false negatives). By setting λ to a strict value appropriate for a lowheritability trait (in effect, looking for a subset of markers that account for only a fraction of the total heritability, with consequently higher noise), we found that a phase transition to good recovery can be achieved with smaller sample sizes, at the cost of selecting a smaller number of markers and hence suffering many false negatives.
One interesting feature of the recovery measure based on the median P value (μ _{ P − value}) is that it seemed to rise as the sample size was increased in the region of poor recovery and then fall after the sample size crossed the CSdetermined phase transition boundary. This rise and then fall was very dramatic in our simulations (Figures 2 and 4) and also appeared in our analysis of height (Figure 9). This behavior may be a consequence of the fact that as the sample size is increased, λ in the algorithm is decreased (see Methods). Hence, in the region of poor recovery, the relaxation of the penalty with increasing sample size may permit the selection of more SNPs and hence the inflation of the FPR and μ _{ P − value}. However, once the phase transition to good performance begins, the recovery measures begin their characteristic sharp decrease. This nonmonotone behavior accentuates the transition boundary and can be exploited to aid its detection.
In summary, compressed sensing utilizes properties of highdimensional systems that are surprising from the perspective of classical statistics. The regression problem faced by GWAS and GS is wellsuited to such an approach, and we have shown that the matrix of SNP genotypes formed from European GWAS data is in fact a wellconditioned sensing matrix. Consequently, we have inferred the sample sizes required to achieve accurate model recovery and demonstrated a method for determining whether the minimal sample size has in fact been obtained.
Methods
L _{1}penalized regression algorithm
where $\widehat{\mathbf{y}}$ is the estimated breeding value given by $\mathbf{A}\widehat{\mathbf{x}}$. The setting of the penalization parameter λ is described below.
We assumed convergence if the fractional change in the objective function given by Equation 2 was less than 10^{− 4}. In addition, we performed lasso with a warm start [54], using a logarithmic descent of 100 steps in λ with λ _{max} = ($\frac{\mathit{1}}{\mathit{n}}$) ∥ Ay ∥_{ L ∞}. For λ _{min} we used $\left({\mathit{\sigma}}_{\mathit{E}}^{*}/\mathit{n}\right)\phantom{\rule{0.5em}{0ex}}{\u2225\phantom{\rule{0.5em}{0ex}}\mathbf{Ae}\phantom{\rule{0.5em}{0ex}}\u2225}_{\mathit{L}\infty}$, where ${\mathit{\sigma}}_{\mathit{E}}^{*}=\sqrt{{\mathit{\sigma}}_{\mathit{E}}^{2}+\frac{1}{\mathit{n}}}$[22]. To estimate ∥A ' e∥_{ L∞} we created 1,000 sample vectors of e, each constructed with n i.i.d. normal elements with mean zero and variance one, and took the median across samples of ∥A ' e∥_{ L∞}. Estimates of $\left({\mathit{\sigma}}_{\mathit{A}}^{2},{\mathit{\sigma}}_{\mathit{E}}^{2}\right)$ with respect to the variants assayed in a given study can be obtained using the genomicrelatedness method [7, 30–32]. The algorithm can also accommodate any other covariates.
Computations
Simulations and analyses were performed using MATLAB 2013 (The MathWorks Inc., Natick, Massachusetts) and PLINK 2 [35, 36]. The L _{1} optimization algorithm was written in MATLAB and also a feature of PLINK 2. P values were estimated using MATLAB’s regstats function and PLINK 2. Colorcoded phase plane figures were generated by sampling the ρ − δ plane and interpolating between points using MATLAB’s scatteredInterpolant function. GWAS data were obtained from dbGaP as described in Data Description. Analysis scripts are available from the GigaScience GigaDB repository and maintained on GitHub [55, 56].
Statistics
The false positive rate (FPR) is the fraction of true zerovalued coefficients that are falsely identified as nonzero. The positive predictive value (PPV) is the number of correctly selected true nonzeros divided by the total number of nonzeros returned by the selection algorithm. 1 − PPV equals the false discovery rate (FDR). The adjusted positive predictive value (PPV*) is similar to the standard PPV, except that any selected nonzero coefficient falling within 500 kb of a GIANTidentified marker is counted as a true positive [39].
The median of the P values for the set of putative nonzeros (μ _{ P − value }) is obtained by: 1) regressing the phenotype on each of the L _{1} selected markers in turn, 2) estimating each P value as the standard twotailed probability from the t test of the null hypothesis that a univariate regression coefficient is equal to zero, and 3) taking the median over the independent tests. This procedure is independent of the selection algorithm and calculated after the L _{1} penalized algorithm has converged. The adjusted median P value (${\mathit{\mu}}_{\mathit{P}\mathrm{value}}^{*}$) is the median of the MR P values falling below the significance threshold divided by the threshold itself.
The LD measure (r ^{2}) is the squared estimate of the Pearson’s product–moment correlation between the standardized zeromean, unitvariance SNPs.
Analysis codes are archived in the GigaScience GigaDB repository and maintained on GitHub [55, 56].
Availability of supporting data
As noted above, the data sets supporting the results of this article are available through dbGaP accession numbers [ARIC:phs000090] and [GENEVA:phs000091], http://www.ncbi.nlm.nih.gov/gap[34]. Mock data sets supporting the results of this article are available in the GigaDB repository, doi:10.5524/100094 and http://gigadb.org/dataset/view/id/100094/[55].
Abbreviations
 ARIC:

Atherosclerosis risk in community
 CS:

Compressed sensing
 FDR:

False discovery rate
 FPR:

False positive rate
 GENEVA:

Gene environment association studies
 GIANT:

Genetic investigation of anthropometric traits
 GS:

Genomic selection
 GWAS:

Genomewide association study
 LD:

Linkage disequilibrium
 LE:

Linkage equilibrium
 MAF:

Minor allele frequency
 MR:

Marginal regression
 NE:

Normalized error
 OLS:

Ordinary least squares
 PPV:

Positive predictive value
 SNP:

Singlenucleotide polymorphism.
Declarations
Acknowledgments
We thank Nick Patterson for comments on earlier versions of this work and Phil Schniter for input on the EMGMAMP algorithm [52]. This work was supported by the Intramural Program of the NIH, National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK).
The Atherosclerosis Risk in Communities Study is carried out as a collaborative study supported by National Heart, Lung, and Blood Institute contracts (HHSN268201100005C, HHSN268201100006C, HHSN268201100007C, HHSN268201100008C, HHSN268201100009C, HHSN268201100010C, HHSN268201100011C, and HHSN268201100012C). The authors thank the staff and participants of the ARIC study for their important contributions. Funding for GENEVA was provided by National Human Genome Research Institute grant U01HG004402 (E. Boerwinkle).
Authors’ Affiliations
References
 Johnstone IM, Titterington DM: Statistical challenges of highdimensional data. Philos Trans R Soc A. 2009, 367: 42374253. 10.1098/rsta.2009.0159.View ArticleGoogle Scholar
 Hoggart CJ, Whittaker JC, De Iorio M, Balding DJ: Simultaneous analysis of all SNPs in genomewide and resequencing association studies. PLoS Genet. 2008, 4: e100013010.1371/journal.pgen.1000130.View ArticlePubMedPubMed CentralGoogle Scholar
 Goddard ME, Wray NR, Verbyla K, Visscher PM: Estimating effects and making predictions from genomewide marker data. Stat Sci. 2009, 24: 517529. 10.1214/09STS306.View ArticleGoogle Scholar
 Kemper KE, Daetwyler HD, Visscher PM, Goddard ME: Comparing linkage and association analyses in sheep points to a better way of doing GWAS. Genet Res. 2012, 94: 191203. 10.1017/S0016672312000365.View ArticleGoogle Scholar
 Genovese CR, Jin J, Wasserman L, Yao Z: A comparison of the lasso and marginal regression. J Mach Learn Res. 2012, 13: 21072143.Google Scholar
 Visscher PM, Brown MA, McCarthy MI, Yang J: Five years of GWAS discovery. Am J Hum Genet. 2012, 90: 724. 10.1016/j.ajhg.2011.11.029.View ArticlePubMedPubMed CentralGoogle Scholar
 Yang J, Benyamin B, McEvoy BP, Gordon S, Henders AK, Nyholt DR, Madden PA, Heath AC, Martin NG, Montgomery GW, Goddard ME, Visscher PM: Common SNPs explain a large proportion of the heritability for human height. Nat Genet. 2010, 42: 565569. 10.1038/ng.608.View ArticlePubMedPubMed CentralGoogle Scholar
 Tibshirani R: Regression shrinkage and selection via the lasso. J Roy Stat Soc B. 1996, 58: 267288.Google Scholar
 Park JH, Gail MH, Weinberg CR, Carroll RJ, Chung CC, Wang Z, Chanock SJ, Fraumeni JF, Chatterjee N: Distribution of allele frequencies and effect sizes and their interrelationships for common genetic susceptibility variants. Proc Natl Acad Sci U S A. 2011, 108: 1802618031. 10.1073/pnas.1114759108.View ArticlePubMedPubMed CentralGoogle Scholar
 Stahl EA, Wegmann D, Trynka G, GutierrezAchury J, Do R, Voight BF, Kraft P, Chen R, Kallberg HJ, Kurreeman FAS, Kathiresan S, Wijmenga C, Gregersen PK, Alfredsson L, Siminovitch KA, Worthington J, Bakker PIW d, Raychaudhuri S, Plenge RM, Diabetes Genetics Replication and MetaAnalysis Consortium, Myocardial Infarction Genetics Consortium: Bayesian inference analyses of the polygenic architecture of rheumatoid arthritis. Nat Genet. 2012, 44: 483489. 10.1038/ng.2232.View ArticlePubMedGoogle Scholar
 Ripke S, O’Dushlaine C, Chambert K, Moran JL, Kähler AK, Akterin S, Bergen SE, Collins AL, Crowley JJ, Fromer M, Kim Y, Lee SH, Magnusson PKE, Sanchez N, Stahl EA, Williams S, Wray NR, Xia K, Bettella F, Børglum AD, BulikSullivan BK, Cormican P, Craddock N, de Leeuw C, Durmishi N, Gill M, Golimbet V, Hamshere ML, Holmans P, Hougaard DM: Genomewide association analysis identifies 13 new risk loci for schizophrenia. Nat Genet. 2013, 45: 11501159. 10.1038/ng.2742.View ArticlePubMedPubMed CentralGoogle Scholar
 Meuwissen T, Hayes BJ, Goddard ME: Prediction of total genetic value using genomewide dense marker maps. Genetics. 2001, 157: 18191829.PubMedPubMed CentralGoogle Scholar
 de los Campos G, Gianola D, Allison DB: Predicting genetic predisposition in humans: the promise of wholegenome markers. Nat Rev Genet. 2010, 11: 880886. 10.1038/nrg2898.View ArticlePubMedGoogle Scholar
 Hayes BJ, Pryce J, Chamberlain AJ, Bowman PJ, Goddard ME: Genetic architecture of complex traits and accuracy of genomic prediction: coat colour, milkfat percentage, and type in Holstein cattle as contrasting model traits. PLoS Genet. 2010, 6: e100113910.1371/journal.pgen.1001139.View ArticlePubMedPubMed CentralGoogle Scholar
 Meuwissen T, Hayes BJ, Goddard ME: Accelerating improvement of livestock with genomic selection. Annu Rev Anim Biosci. 2013, 1: 221237. 10.1146/annurevanimal031412103705.View ArticlePubMedGoogle Scholar
 Usai MG, Goddard ME, Hayes BJ: LASSO with crossvalidation for genomic selection. Genet Res. 2009, 91: 427436. 10.1017/S0016672309990334.View ArticleGoogle Scholar
 Wimmer V, Lehermeier C, Albrecht T, Auinger HJ, Wang Y, Schön CC: Genomewide prediction of traits with different genetic architecture through efficient variable selection. Genetics. 2013, 195: 573587. 10.1534/genetics.113.150078.View ArticlePubMedPubMed CentralGoogle Scholar
 Zhou X, Carbonetto P, Stephens M: Polygenic modeling with Bayesian sparse linear mixed models. PLoS Genet. 2013, 9: e100326410.1371/journal.pgen.1003264.View ArticlePubMedPubMed CentralGoogle Scholar
 Gianola D: Priors in wholegenome regression: the Bayesian alphabet returns. Genetics. 2013, 194: 573596. 10.1534/genetics.113.151753.View ArticlePubMedPubMed CentralGoogle Scholar
 Donoho DL, Tanner J: Sparse nonnegative solution of underdetermined linear equations by linear programming. Proc Natl Acad Sci U S A. 2005, 102: 94469451. 10.1073/pnas.0502269102.View ArticlePubMedPubMed CentralGoogle Scholar
 Candès EJ, Plan Y: Nearideal model selection by L _{1} minimization. Ann Stat. 2009, 37: 21452177. 10.1214/08AOS653.View ArticleGoogle Scholar
 Candès EJ, Plan Y: A probabilistic and RIPless theory of compressed sensing. IEEE Trans Inform Theory. 2011, 57: 72357254.View ArticleGoogle Scholar
 Candès EJ, Romberg J, Tao T: Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information. IEEE Trans Inform Theory. 2006, 52: 489509.View ArticleGoogle Scholar
 Donoho DL, Maleki A, Montanari A: The noisesensitivity phase transition in compressed sensing. IEEE Trans Inform Theory. 2011, 57: 69206941.View ArticleGoogle Scholar
 Donoho DL, Maleki A, Montanari A: Messagepassing algorithms for compressed sensing. Proc Natl Acad Sci U S A. 2009, 106: 1891418919. 10.1073/pnas.0909892106.View ArticlePubMedPubMed CentralGoogle Scholar
 Donoho DL: Highdimensional centrally symmetric polytopes with neighborliness proportional to dimension. Discrete Comput Geom. 2006, 35: 617652. 10.1007/s0045400512200.View ArticleGoogle Scholar
 Donoho DL, Tanner J: Observed universality of phase transitions in highdimensional geometry, with implications for modern data analysis and signal processing. Philos Trans A Math Phys Eng Sci. 2009, 367: 427393. 10.1098/rsta.2009.0152.View ArticlePubMedGoogle Scholar
 Donoho DL, Stodden V: International joint conference on neural networks. Breakdown point of model selection when the number of variables exceeds the number of observations. 2006, 19161921.Google Scholar
 Monajemi H, Jafarpour S, Gavish M, Donoho DL, Stat 330/CME 362 Collaboration: Deterministic matrices matching the compressed sensing phase transition of Gaussian random matrices. Proc Natl Acad Sci U S A. 2013, 110: 11811186. 10.1073/pnas.1219540110.View ArticlePubMedGoogle Scholar
 Vattikuti S, Guo J, Chow CC: Heritability and genetic correlations explained by common SNPs for metabolic syndrome traits. PLoS Genet. 2012, 8 (3): e100263710.1371/journal.pgen.1002637. doi:10.1371/journal.pgen.1002637View ArticlePubMedPubMed CentralGoogle Scholar
 Vattikuti S, Chow CC: Software and supporting material for: “Heritability and genetic correlations explained by common SNPs for metabolic syndrome traits”. GitHub.https://github.com/ShashaankV/MVMLE,
 Lee JJ, Chow CC: Conditions for the validity of SNPbased heritability estimation. Hum Genet. 2014, doi10.1007/s0043901414415Google Scholar
 Johnstone IM: Oracle inequalities and nonparametric function estimation. Documenta Mathematica. 1998, 3: 267278.Google Scholar
 Mailman MD, Feolo M, Jin Y, Kimura M, Tryka K, Bagoutdinov R, Hao L, Kiang A, Paschall J, Phan L, Popova N, Pretel S, Ziyabari L, Lee M, Shao Y, Wang ZY, Sirotkin K, Ward M, Kholodov M, Zbicz K, Beck J, Kimelman M, Shevelev S, Preuss D, Yaschenko E, Graeff A, Ostell J, Sherry ST: The NCBI dbGaP database of genotypes and phenotypes. Nat Genet. 2007, 39 (10): 11816.View ArticlePubMedPubMed CentralGoogle Scholar
 Purcell SM, Neale BM, ToddBrown K, Thomas L, Ferreira MAR, Bender D, Maller J, Sklar P, Bakker PIW d, Daly MJ, Sham PC: PLINK: A tool set for wholegenome association and populationbased linkage analyses. Am J Hum Genet. 2007, 81: 559575. 10.1086/519795.View ArticlePubMedPubMed CentralGoogle Scholar
 Shaun P, Christopher C: PLINK 2.https://www.coggenomics.org/plink2.
 Davies G, Tenesa A, Payton A, Yang J, Harris SE, Goddard ME, Liewald D, Ke X, Le Hellard S, Christoforou A, Luciano M, McGhee KA, Lopez LM, Gow AJ, Corley J, Redmond P, Fox HC, Haggarty P, Whalley LJ, McNeill G, Espeseth T, Lundervold AJ, Reinvang I, Pickles A, Steen VM, Ollier W, Porteous DJ, Horan MA, Starr JM, Pendleton N: Genomewide association studies establish that human intelligence is highly heritable and polygenic. Mol Psychiatry. 2011, 16: 9961005. 10.1038/mp.2011.85.View ArticlePubMedPubMed CentralGoogle Scholar
 Chatterjee N, Wheeler B, Sampson J, Hartge P, Chanock SJ, Park JH: Projecting the performance of risk prediction based on polygenic analyses of genomewide association studies. Nat Genet. 2013, 45: 400405. 10.1038/ng.2579.View ArticlePubMedPubMed CentralGoogle Scholar
 Lango Allen H, Estrada K, Lettre G, Berndt SI, Weedon MN, Rivadeneira F, Willer CJ, Jackson AU, Vedantam S, Raychaudhuri S, Ferreira T, Wood AR, Weyant RJ, Segre AV, Speliotes EK, Wheeler E, Soranzo N, Park JH, Yang J, Gudbjartsson D, HeardCosta NL, Randall JC, Qi L, Vernon Smith A, Magi R, Pastinen T, Liang L, Heid IM, Luan J, Thorleifsson G: Hundreds of variants clustered in genomic loci and biological pathways affect human height. Nature. 2010, 467: 832838. 10.1038/nature09410.View ArticlePubMedPubMed CentralGoogle Scholar
 Johnson AD, Handsaker RE, Pulit SL, Nizzari MM, O’Donnell CJ, Bakker PIW d: SNAP: A webbased tool for identification and annotation of proxy SNPs using HapMap. Bioinformatics. 2008, 24: 29382939. 10.1093/bioinformatics/btn564.View ArticlePubMedPubMed CentralGoogle Scholar
 Abraham G, Kowalczyk A, Zobel J, Inouye M: Performance and robustness of penalized and unpenalized methods for genetic prediction of complex human disease. Genet Epidemiol. 2013, 37: 184195. 10.1002/gepi.21698.View ArticlePubMedGoogle Scholar
 Donoho DL, Tanner J: Precise undersampling theorems. Proc IEEE. 2010, 98: 913924. 10.1109/JPROC.2010.2045630.View ArticleGoogle Scholar
 Storey J, Tibshirani R: Statistical significance for genomewide studies. Proc Natl Acad Sci. 2003, 100: 94409445. 10.1073/pnas.1530509100.View ArticlePubMedPubMed CentralGoogle Scholar
 Maller JB, McVean G, Byrnes J, Vukcevic D, Palin K, Su Z, Howson JMM, Auton A, Myers S, Morris A, Pirinen M, Brown MA, Burton PR, Caulfield MJ, Compston A, Farrall M, Hall AS, Hattersley AT, Hill AVS, Mathew CG, Pembrey M, Satsangi J, Stratton MR, Worthington J, Craddock N, Hurles M, Ouwehand WH, Parkes M, Rahman N, Duncanson A: Bayesian refinement of association signals for 14 loci in 3 common diseases. Nat Genet. 2012, 44: 12941301. 10.1038/ng.2435.View ArticlePubMedPubMed CentralGoogle Scholar
 Edwards SL, Beesley J, French JD, Dunning AM: Beyond GWASs: illuminating the dark road from association to function. Am J Hum Genet. 2013, 93: 779797. 10.1016/j.ajhg.2013.10.012.View ArticlePubMedPubMed CentralGoogle Scholar
 Hedrick PW: Gametic disequilibrium measures: proceed with caution. Genetics. 1987, 117: 33141.PubMedPubMed CentralGoogle Scholar
 Wray NR, Purcell SM, Visscher PM: Synthetic associations created by rare variants do not explain most GWAS results. PLoS Biol. 2011, 9: e100057910.1371/journal.pbio.1000579.View ArticlePubMedPubMed CentralGoogle Scholar
 Yang J, Ferreira T, Morris AP, Medland SE, Madden PAF, Heath AC, Martin NG, Montgomery GW, Weedon MN, Loos RJ, Frayling TM, McCarthy MI, Hirschhorn JN, Goddard ME, Visscher PM, Genetic Investigation of Anthropometric Traits Consortium, Diabetes Genetics Replication and MetaAnalysis Consortium: Conditional and joint multipleSNP analysis of GWAS summary statistics identifies additional variants influencing complex traits. Nat Genet. 2012, 44: 369375. 10.1038/ng.2213.View ArticlePubMedPubMed CentralGoogle Scholar
 Candès EJ, Romberg JK, Tao T: Stable signal recovery from incomplete and inaccurate measurements. Commun Pure Appl Math. 2006, 59: 12071223. 10.1002/cpa.20124.View ArticleGoogle Scholar
 Wellcome Trust Case Control Consortium: Genomewide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 2007, 447: 661678. 10.1038/nature05911.View ArticleGoogle Scholar
 Turchin MC, Chiang CWK, Palmer CD, Sankararaman S, Reich D, Hirschhorn JN, Genetic Investigation of Anthropometric Traits Consortium: Evidence of widespread selection on standing variation in Europe at heightassociated SNPs. Nat Genet. 2012, 44: 10151019. 10.1038/ng.2368.View ArticlePubMedPubMed CentralGoogle Scholar
 Vila J, Schniter P: Expectationmaximization gaussianmixture approximate message passing. IEEE Trans Signal Process. 2013, 61: 48584672.View ArticleGoogle Scholar
 Friedman J, Hastie T, Höfling H, Tibshirani R: Pathwise coordinate optimization. Ann Appl Stat. 2007, 1: 302332. 10.1214/07AOAS131.View ArticleGoogle Scholar
 Friedman J, Hastie T, Tibshirani R: Regularization paths for generalized linear models via coordinate descent. J Stat Softw. 2010, 33: 122.View ArticlePubMedPubMed CentralGoogle Scholar
 Vattikuti S, Lee JJ, Chang CC, Hsu SDH, Chow CC: Software and supporting material for: “Applying compressed sensing to genomewide association studies”. GigaScience Database. 2014,http://dx.doi.org/10.5524/100094,Google Scholar
 Vattikuti S, Lee JJ, Chang CC, Hsu SDH, Chow CC: Software and supporting material for: “Applying compressed sensing to genomewide association studies”. GitHub. https://github.com/ShashaankV/CS and https://github.com/ShashaankV/GD
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.