The signed KolmogorovSmirnov test: why it should not be used
 Guillaume J Filion^{1, 2}Email author
DOI: 10.1186/s1374201500487
© Filion; licensee BioMed Central. 2015
Received: 2 December 2014
Accepted: 3 February 2015
Published: 27 February 2015
Abstract
The twosample KolmogorovSmirnov (KS) test is often used to decide whether two random samples have the same statistical distribution. A popular modification of the KS test is to use a signed version of the KS statistic to infer whether the values of one sample are statistically larger than the values of the other. The underlying hypotheses of the KS test are intrinsically incompatible with this approach and the test can produce false positives supported by extremely low pvalues. This potentially makes the signed KS test a tool of phacking, which should be discouraged by replacing it with standard tests such as the ttest and by providing confidence intervals instead of pvalues.
Keywords
Kolmogorovsmirnov test Statistics Pvalue PhackingBackground
From its inception, the twosample KolmogorovSmirnov (KS) test was designed as a generic method to test whether two random samples are drawn from the same distribution. The null hypothesis of the KS test is that both distributions are identical, without any further assumption regarding their location and shape, which makes the KS test widely applicable. The statistic of the KS test is a distance between the two empirical distributions, computed as the maximum absolute difference between their cumulative curves. Several studies in the field of genomics (such as [15]) have suggested the use of the signed difference between the cumulative curves. According to this view, the sign of the statistic indicates which of the two distributions has the larger values. This procedure does not have a formal name; for clarity, I will refer to it as the “signed KS test” (sKS test).
However, this argument makes an implicit assumption that does not necessary hold. Figure 1A shows two curves with the same shape, which means that they can differ only by their location, i.e. by a shift to the left or to the right. However, the KS test discriminates distributions when they differ by either their location or by their shape.
Figure 1B shows another ideal example of two distributions compared by the sKS test, but this time they differ only in their variance. There are two positions at which the cumulative curves differ the most, which is why two arrows are drawn. More importantly, one arrow points upward, whereas the other points downward, so that the sign of the sKS statistic is undefined. In finite samples, the distributions are never perfectly symmetrical, so one of these arrows would be the longest and each would have a probability of 0.5. Interestingly, the pvalue is extremely small if the samples are large, but the sign of the sKS statistic would be random.
This ideal example never happens in practice. The distributions of biological samples typically differ in shape and location, so the situation shown in Figure 1B is unrealistic. In a real example, the difference between the shapes of the distribution will boost the significance of the sKS test, yielding low pvalues even when the difference in location is modest or nonexistent.
If the distributions have the same shape (as in Figure 2A), then the sKS test is meaningful, but there is still no reason to use it because it is less powerful than the ttest and even the WilcoxonMann–Whitney test. In other words, the ttest and the WilcoxonMann–Whitney test have more chance of detecting a shift when it exists (Garrett Jenkinson provides a power analysis in his review of this article, available in the prepublication history). This issue is due in part to floor and ceiling effects, meaning that the sKS test statistic will be small if it is in either tail of the distribution.
It is thus surprising that an unconventional approach such as the sKS test would be used in place of an established standard such as the ttest. Among other reasons, it may be part of a flawed practice called “phacking”, which is to test the same statistical hypothesis in different ways until a target pvalue is obtained. The misconception at the root of phacking is that a higher statistical significance entails a larger biological response (Figure 2B is an example of the opposite). Replacing the sKS test by more standard options would be an improvement, but a better method can be used.
When using a statistical test to evaluate the significance of a response, it is important to conclude with a statement regarding the magnitude of the effect. Confidence intervals are a natural check that should help researchers distinguish statistical significance from biological significance. For instance, in the example shown in Figure 2A, the ttest yields a pvalue lower than 2.2 × 10^{−16}, which suggests that, in B cells, Spi1 HOMER scores are different between active and inactive enhancers. However, giving (0.36, 0.53) as a 95% confidence interval for this difference is more informative because it is a specific statement about the magnitude and it allows the reader to decide whether it is biologically relevant.
Conclusions
At a time when the field of genomics is progressively becoming standardized, it is important to enforce a certain statistical rigor. The sKS is not consistent and it is less powerful than the ttest and the WilcoxonMann–Whitney test, so there is no reason to use it unless carefully justified. More generally, testing a statistical response should include some information about the magnitude of the effect, for instance in the form of a confidence interval. Such practices would provide valuable information to researchers and discourage phacking.
Abbreviations
 KS test:

KolmogorovSmirnov test
 sKS:

signed KolmogorovSmirnov test
Declarations
Acknowledgements
I thank Garrett Jenkinson and Desmond D Campbell for their helpful and constructive comments.
Authors’ Affiliations
References
 LaraAstiaso D, Weiner A, LorenzoVivas E, Zaretsky I, Jaitin DA, David E, et al. Chromatin state dynamics during blood formation. Science. 2014;345:943–9.View ArticlePubMedPubMed CentralGoogle Scholar
 Winter EE, Goodstadt L, Ponting CP. Elevated rates of protein secretion, evolution, and disease among tissuespecific genes. Genome Res. 2004;14:54–61.View ArticlePubMedPubMed CentralGoogle Scholar
 AlShahrour F, Minguez P, Tárraga J, Medina I, Alloza E, Montaner D, et al. FatiGO +: a functional profiling tool for genomic data. Integration of functional annotation, regulatory motifs and interaction data with microarray experiments. Nucleic Acids Res. 2007;35 Web Server Issue:W916.Google Scholar
 Stark KL, Xu B, Bagchi A, Lai WS, Liu H, Hsu R, et al. Altered brain microRNA biogenesis contributes to phenotypic deficits in a 22q11deletion mouse model. Nat Genet. 2008;40:751–60.View ArticlePubMedGoogle Scholar
 Landan G, Cohen NM, Mukamel Z, Bar A, Molchadsky A, Brosh R, et al. Epigenetic polymorphism and the stochastic formation of differentially methylated regions in normal and cancerous tissues. Nat Genet. 2012;44:1207–14.View ArticlePubMedGoogle Scholar
 Heinz S, Benner C, Spann N, Bertolino E, Lin YC, Laslo P, et al. Simple combinations of lineagedetermining transcription factors prime cisregulatory elements required for macrophage and B cell identities. Mol Cell. 2010;38:576–89.View ArticlePubMedPubMed CentralGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.