The signed Kolmogorov-Smirnov test: why it should not be used
© Filion; licensee BioMed Central. 2015
Received: 2 December 2014
Accepted: 3 February 2015
Published: 27 February 2015
The two-sample Kolmogorov-Smirnov (KS) test is often used to decide whether two random samples have the same statistical distribution. A popular modification of the KS test is to use a signed version of the KS statistic to infer whether the values of one sample are statistically larger than the values of the other. The underlying hypotheses of the KS test are intrinsically incompatible with this approach and the test can produce false positives supported by extremely low p-values. This potentially makes the signed KS test a tool of p-hacking, which should be discouraged by replacing it with standard tests such as the t-test and by providing confidence intervals instead of p-values.
KeywordsKolmogorov-smirnov test Statistics P-value P-hacking
From its inception, the two-sample Kolmogorov-Smirnov (KS) test was designed as a generic method to test whether two random samples are drawn from the same distribution. The null hypothesis of the KS test is that both distributions are identical, without any further assumption regarding their location and shape, which makes the KS test widely applicable. The statistic of the KS test is a distance between the two empirical distributions, computed as the maximum absolute difference between their cumulative curves. Several studies in the field of genomics (such as [1-5]) have suggested the use of the signed difference between the cumulative curves. According to this view, the sign of the statistic indicates which of the two distributions has the larger values. This procedure does not have a formal name; for clarity, I will refer to it as the “signed KS test” (sKS test).
However, this argument makes an implicit assumption that does not necessary hold. Figure 1A shows two curves with the same shape, which means that they can differ only by their location, i.e. by a shift to the left or to the right. However, the KS test discriminates distributions when they differ by either their location or by their shape.
Figure 1B shows another ideal example of two distributions compared by the sKS test, but this time they differ only in their variance. There are two positions at which the cumulative curves differ the most, which is why two arrows are drawn. More importantly, one arrow points upward, whereas the other points downward, so that the sign of the sKS statistic is undefined. In finite samples, the distributions are never perfectly symmetrical, so one of these arrows would be the longest and each would have a probability of 0.5. Interestingly, the p-value is extremely small if the samples are large, but the sign of the sKS statistic would be random.
This ideal example never happens in practice. The distributions of biological samples typically differ in shape and location, so the situation shown in Figure 1B is unrealistic. In a real example, the difference between the shapes of the distribution will boost the significance of the sKS test, yielding low p-values even when the difference in location is modest or non-existent.
If the distributions have the same shape (as in Figure 2A), then the sKS test is meaningful, but there is still no reason to use it because it is less powerful than the t-test and even the Wilcoxon-Mann–Whitney test. In other words, the t-test and the Wilcoxon-Mann–Whitney test have more chance of detecting a shift when it exists (Garrett Jenkinson provides a power analysis in his review of this article, available in the pre-publication history). This issue is due in part to floor and ceiling effects, meaning that the sKS test statistic will be small if it is in either tail of the distribution.
It is thus surprising that an unconventional approach such as the sKS test would be used in place of an established standard such as the t-test. Among other reasons, it may be part of a flawed practice called “p-hacking”, which is to test the same statistical hypothesis in different ways until a target p-value is obtained. The misconception at the root of p-hacking is that a higher statistical significance entails a larger biological response (Figure 2B is an example of the opposite). Replacing the sKS test by more standard options would be an improvement, but a better method can be used.
When using a statistical test to evaluate the significance of a response, it is important to conclude with a statement regarding the magnitude of the effect. Confidence intervals are a natural check that should help researchers distinguish statistical significance from biological significance. For instance, in the example shown in Figure 2A, the t-test yields a p-value lower than 2.2 × 10−16, which suggests that, in B cells, Spi1 HOMER scores are different between active and inactive enhancers. However, giving (0.36, 0.53) as a 95% confidence interval for this difference is more informative because it is a specific statement about the magnitude and it allows the reader to decide whether it is biologically relevant.
At a time when the field of genomics is progressively becoming standardized, it is important to enforce a certain statistical rigor. The sKS is not consistent and it is less powerful than the t-test and the Wilcoxon-Mann–Whitney test, so there is no reason to use it unless carefully justified. More generally, testing a statistical response should include some information about the magnitude of the effect, for instance in the form of a confidence interval. Such practices would provide valuable information to researchers and discourage p-hacking.
- KS test:
signed Kolmogorov-Smirnov test
I thank Garrett Jenkinson and Desmond D Campbell for their helpful and constructive comments.
- Lara-Astiaso D, Weiner A, Lorenzo-Vivas E, Zaretsky I, Jaitin DA, David E, et al. Chromatin state dynamics during blood formation. Science. 2014;345:943–9.View ArticlePubMedPubMed CentralGoogle Scholar
- Winter EE, Goodstadt L, Ponting CP. Elevated rates of protein secretion, evolution, and disease among tissue-specific genes. Genome Res. 2004;14:54–61.View ArticlePubMedPubMed CentralGoogle Scholar
- Al-Shahrour F, Minguez P, Tárraga J, Medina I, Alloza E, Montaner D, et al. FatiGO +: a functional profiling tool for genomic data. Integration of functional annotation, regulatory motifs and interaction data with microarray experiments. Nucleic Acids Res. 2007;35 Web Server Issue:W91-6.Google Scholar
- Stark KL, Xu B, Bagchi A, Lai WS, Liu H, Hsu R, et al. Altered brain microRNA biogenesis contributes to phenotypic deficits in a 22q11-deletion mouse model. Nat Genet. 2008;40:751–60.View ArticlePubMedGoogle Scholar
- Landan G, Cohen NM, Mukamel Z, Bar A, Molchadsky A, Brosh R, et al. Epigenetic polymorphism and the stochastic formation of differentially methylated regions in normal and cancerous tissues. Nat Genet. 2012;44:1207–14.View ArticlePubMedGoogle Scholar
- Heinz S, Benner C, Spann N, Bertolino E, Lin YC, Laslo P, et al. Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities. Mol Cell. 2010;38:576–89.View ArticlePubMedPubMed CentralGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.