Reanalyzing Head et al. (2015): No widespread p-hacking after all?

Abstract

Statistical significance seeking (i.e., p-hacking) is a serious problem for the validity of research, especially if it occurs frequently. Head et al. provided evidence for widespread p-hacking throughout the sciences, which would indicate that the validity of science is in doubt. Previous substantive concerns about their selection of p-values indicated they were too liberal in selecting all reported p-values, which would result in including results that would not be interesting to have been p-hacked. Despite this liberal selection of p-values Head et al. found evidence for p-hacking, which raises the question why p-hacking was detected despite it being unlikely a priori. In this paper I reanalyze the original data and indicate Head et al. their results are an artefact of rounding in the reporting of p-values.

Megan Head and colleagues (Head 2015) provide a large collection of p-values that, from their perspective, indicates widespread statistical significance seeking (i.e., p-hacking) throughout the sciences. Concerns have been raised about their selection of p-values, which was deemed too liberal and unlikely to find p-hacking to begin with (Simonsohn 2015), which raises the question why evidence for p-hacking was found by Head et al. nonetheless. The analyses that form the basis of their conclusions operate on the tenet that p-hacked papers show p-value distributions that are left skew below .05 (Simonsohn 2014). In this paper I evaluate their selection choices and analytic strategy in the original paper and suggest that Head et al. found widespread p-hacking as an artefact of rounding. Analysis files for this paper are available at https://osf.io/sxafg/.

The p-value distribution of a set of heterogeneous results, as collected by Head et al., should be a mixture distribution of only the uniform p-value distribution under the null hypothesis \(H_0\) and right-skew p-value distributions under the alternative hypothesis \(H_1\). Questionable, p-hacking behaviors affect the p-value distribution. An example is optional stopping, which causes a bump of p-values just below .05 only if the null hypothesis is true (Lakens 2014).

Head et al. correctly argue that an aggregate p-value distribution could show a bump below .05 if optional stopping under the null, or other behaviors seeking just significant results, occurs frequently. Consequently, a bump below .05 (i.e., left-skew), is a sufficient condition for the presence of specific forms of p-hacking. However, this bump below .05 is not a necessary condition, because other types of p-hacking do not cause such a bump. For example, one might use optional stopping when there is a true effect (Lakens 2014) or conduct multiple analyses, but only report that which yielded the smallest p-value. Therefore, if no bump is found, this does not exclude that p-hacking occurs at a large scale.

This paper is structured into three parts: (i) explaining the data analytic strategy of the reanalysis, (ii) reevaluating the evidence for left-skew p-hacking based on the reanalysis, and (iii) discussing the findings in light of the literature.

Reanalytic strategy

Head and colleagues their data analytic strategy focused on comparing frequencies in the last and penultimate bins from .05 at a binwidth of .005. Based on the tenet that p-hacking introduces a left-skew p-distribution (Simonsohn 2014), evidence for p-hacking is present if the last bin has a sufficiently higher frequency than the penultimate one in a binomial test. Applying the binomial test to two frequency bins has previously been used in publication bias research and is typically called a Caliper test (Gerber 2010, Kühberger 2014), applied here specifically to test for left-skew p-hacking.

The two panels in Fig 1 describe the selection of p-values in the original and current paper. The top panel shows the selection made by Head et al. (i.e., \(.04<p< .045\) versus \(.045<p<.05\)), where the right bin shows a slightly higher frequency than the left bin. This is the evidence Head et al. found for p-hacking. However, if we expand the range and look at the entire distribution, we see that this is an unrepresentative part of the distribution of significant p-values.

Fig. 1. Histogram of p-values as selected in Head et al. (\(.04 < p < .045\) versus \(.045 < p < .05\); top) and the full \(p\)-value distribution \(\leq.05\) (binwidth = .00125; bottom).

The bottom panel in Fig 1 indicates there is a reporting tendency at the second decimal for p-values larger than or equal to \(.01\). If no reporting tendencies existed, the distribution would show a reasonably smooth distribution, resembling the distribution between \(0\) and \(.01\). However, the depicted distribution violates this, where p-value frequencies drastically increase at each second decimal place in the distribution. A post-hoc explanation for this is that three decimal reporting of p-values has only been prescribed since 2010 in psychology (APA 2010), where it previously prescribed two decimal reporting (APA 1983, APA 2001). Because reporting has occurred to the second decimal place for a long time and can be seen to have a substantial effect on the distribution, I think it is important to take this into account in the bin selection.

Head et al. selected the bins as indicated in the top panel in Fig 1, removing the second decimal. For their tests of p-hacking, they compared the bin frequency of the adjacent bins \(.04<p<.045\) versus \(.045<p<.05\). The original authors “suspect that many authors do not regard \(p=.05\) as significant” (Head 2015), which is why they eliminate the second decimal from their analyses by using the selection criterion \(<.05\). Previous investigation of p-values reported as exactly .05 revealed that 94.3% of 236 cases interpret this as statistically significant (Nuijten 2015).

This contradicts the premise that most researchers do not interpret \(p=.05\) as significant, which removes the reason for eliminating the second decimal. Consequently, only exactly reported p-values smaller than or equal to .05 were retained for the reanalyses, whereas Head et al. retained only exactly reported p-values smaller than .05. Moreover, because of reporting tendencies and the inclusion of the second decimal, the analyses need to compare the frequencies below .04 and .05 (e.g., \(.03875<p<.04\) versus \(.04875<p<.05\) for binwidth .00125). This corresponds to the two bins shown in the bottom panel of Fig 1 at .04 and .05.

In this paper, binomial proportion tests for left-skew p-hacking were conducted in both the frequentist and Bayesian framework, where \(H_0:Prop.\leq.5\). The frequentist p-value gives the probability of the data if the null hypothesis is true, but does not quantify the probability of the null and alternative hypotheses. A Bayes Factor (\(BF\)) quantifies these latter probabilities, either as \(BF_{10}\), the alternative hypothesis versus the null hypothesis, or vice versa, \(BF_{01}\). A \(BF\) of 1 indicates that both hypotheses are equally probable, given the data. In this specific instance, \(BF_{10}\) is computed and values \(>1\) can be interpreted, for our purposes, as: the data are more likely under left-skew p-hacking than under no left-skew p-hacking. \(BF_{10}\) values \(<1\) indicate that the data are more likely under no left-skew p-hacking than under left-skew p-hacking. The further removed from \(1\), the more evidence in the direction of either one hypothesis, which were assumed to be equally likely in the prior distribution. For the current analyses, equal priors were assumed.

Reanalysis results

Results of the reanalysis indicate that no evidence for left-skew p-hacking remains when we take into account a second-decimal reporting bias. Initial sensitivity analyses using the original analysis script strengthened original results after eliminating DOI selection and using \(p\leq.05\) as selection criterion instead of \(p<.05\). However, as explained above, this result is confounded due to not taking into account the second decimal. Reanalyses across all disciplines showed no evidence for left-skew p-hacking, \(Prop.=.417,p>.999, BF_{10}<.001\) for the Results sections and \(Prop.=.358,p>.999,BF_{10}<.001\) for the Abstract sections. These results are not dependent on binwidth .00125, as is seen in Table 1 where results for alternate binwidths are shown. Separated per discipline, no binomial test for left-skew p-hacking is statistically significant in either the Results- or Abstract sections (see S1 File). This indicates that the effect found originally by Head and colleagues does not hold when we take into account that reported p-values show reporting bias at the second decimal.

Table 1. Results of reanalysis across various binwidths (i.e., .00125, .005, .01).
Abstracts Results
Binwidth = .00125 (\(.03875-.04\)) 4597 26047
(\(.04875-.05\)) 2565 18664
\(Prop.\) 0.358 0.417
\(p\) \(>\).999 \(>\).999
\(BF_{10}\) \(<\).001 \(<\).001
Binwidth = .005 (\(.035-.04\)) 6641 38537
(\(.045-.05\)) 4485 30406
\(Prop.\) 0.403 0.441
\(p\) \(>\).999 \(>\).999
\(BF_{10}\) \(<\).001 \(<\).001
Binwidth = .01 (\(.03-.04\)) 9885 58809
(\(.04-.05\)) 7250 47755
\(Prop.\) 0.423 0.448
\(p\) \(>\).999 \(>\).999
\(BF_{10}\) \(<\).001 \(<\).001

Discussion

The current reanalysis thus finds no evidence for widespread left-skew p-hacking. This might seem inconsistent with previous findings, such as the low replication rates in psychology (Open Science Collaboration 2015) or high self-admission rate of p-hacking (John 2012). However, these results are not necessarily inconsistent because they are not mutually exclusive, as explained below.

Low replication rates could be caused by widespread p-hacking, but can also occur under systemic low power (Bakker 2014, Bakker 2012). Previous research has indicated low power levels in, for example, psychology (Cohen 1962, Sedlmeier 1989) and randomized clinical trials (Moher 1994). As a consequence of low power some argue that there is a high prevalence of false positives (Ioannidis 2005), which would result in low replication rates.

Additionally, high self-admission rates of p-hacking (John 2012) pertain to such behaviors occurring at least once. Even if there is widespread occurrence of p-hacking across researchers, this does not necessitate frequent occurrence. In other words, a researcher might admit to having p-hacked sometime during his career, but this does not necessitate that it occurred frequently. Moreover, as noted in the introduction, not all p-hacking behaviors lead to left-skew in the p-value distribution. The method used to detect p-hacking in this paper is sensitive to only left-skew p-hacking and it is therefore possible that other types of p-hacking occur, but are not detected.

In this reanalysis two minor limitations remain with respect to the data analysis. First, selecting the bins just below .04 and .05 results in selecting non-adjacent bins. Hence, the test might be less sensitive to detecting left-skew p-hacking. In light of this limitation I ran the original analysis from Head et al., but included the second decimal, which resulted in the comparison of \(.04\leq p<.045\) versus \(.045<p\leq.05\). This analysis also yielded no evidence for left-skew p-hacking, \(Prop.=.457,p>.999,BF_{10}<.001\). Second, the selection of only exactly reported p-values might have distorted the p-value distribution due to minor rounding biases. Previous research has indicated that p-values are somewhat more likely to be rounded to .05 rather than to .04 (Krawczyk 2015). Therefore, selecting only exactly reported p-values might cause an underrepresentation of .05 values, because p-values are more frequently rounded and reported as \(<.05\) instead of exactly (e.g., \(p=.046\)). This limitation also applies to the original paper by Head et al. and is therefore a general, albeit minor, limitation to analyzing p-value distributions.

Conclusion

Based on the results of this reanalysis, it can be concluded that the original evidence for widespread evidence of left-skew p-hacking (Head 2015) is not robust to data analytic choices made. Additionally, absence of evidence for left-skew p-hacking should not be interpreted as evidence for the total absence of left-skew p-hacking. In other words, even though no evidence for left-skew p-hacking was found, this does not mean it does not occur at all — it only indicates that it does not occur so frequently such that the aggregate distribution of significant p-values in science becomes left-skewed and that the conclusions made by Head et al. are not warranted.

Acknowledgments

Joost de Winter, Marcel van Assen, Robbie van Aert, Michèle Nuijten, and Jelte Wicherts provided fruitful discussion or feedback on the ideas presented in this paper. The end result is the author’s sole responsibility.

Supporting Information

S1 File. Full reanalysis results per discipline.

References

  1. Megan L. Head, Luke Holman, Rob Lanfear, Andrew T. Kahn, Michael D. Jennions. The Extent and Consequences of P-Hacking in Science. PLoS Biol 13, e1002106 Public Library of Science (PLoS), 2015. Link

  2. Uri Simonsohn, Joseph P Simmons, Leif D Nelson. Better P-curves: Making P-curve analysis more robust to errors, fraud, and ambitious P-hacking, a Reply to Ulrich and Miller (2015). Journal of experimental psychology. General 144, 1146–1152 (2015). Link

  3. Uri Simonsohn, Leif D Nelson, Joseph P Simmons. P-curve: A key to the file-drawer.. Journal of Experimental Psychology: General 143, 534-47 (2014). Link

  4. Daniël Lakens. What p -hacking really looks like: A comment on Masicampo and LaLande (2012). The Quarterly Journal of Experimental Psychology 68, 829–832 Informa UK Limited, 2014. Link

  5. A.S. Gerber, N. Malhotra, C.M. Dowling, D. Doherty. Publication bias in two political behavior literatures. American Politics Research 38, 591-613 (2010). Link

  6. A. Kühberger, A. Fritz, T. Scherndl. Publication bias in psychology: A diagnosis based on the correlation between effect size and sample size.. PloS one 9, e105825 (2014). Link

  7. APA. Publication manual of the American Psychological Association. American Psychological Association, 2010.

  8. APA. Publication manual of the American Psychological Association. American Psychological Association, 1983.

  9. APA. Publication manual of the American Psychological Association. American Psychological Association, 2001.

  10. M. B. Nuijten, C. H. J. Hartgerink, M. A. L. M. Van Assen, S. Epskamp, J. M. Wicherts. The Prevalence of Statistical Reporting Errors in Psychology (1985-2013). Behavior Research Methods (2015). Link

  11. Open Science Collaboration. Estimating the reproducibility of psychological science. Science 349 (2015). Link

  12. Leslie K John, George Loewenstein, Drazen Prelec. Measuring the prevalence of questionable research practices with incentives for truth telling. Psychological science 23, 524–532 (2012). Link

  13. M Bakker. Flawed intuitions about power in psychological research. 109–120 In Good science, bad science: Questioning research practices in psychological research. (2014).

  14. M. Bakker, A. van Dijk, J. M. Wicherts. The Rules of the Game Called Psychological Science. Perspectives on Psychological Science 7, 543–554 SAGE Publications, 2012. Link

  15. J Cohen. The statistical power of abnormal social psychological research: A review. Journal of Abnormal and Social Psychology 65, 145–153 (1962).

  16. Peter Sedlmeier, Gerd Gigerenzer. Do studies of statistical power have an effect on the power of studies?. Psychological Bulletin 105, 309–316 (1989).

  17. D Moher, C S Dulberg, G A Wells. Statistical power, sample size, and their reporting in randomized controlled trials. JAMA: the journal of the American Medical Association 272, 122–124 (1994). Link

  18. John P A Ioannidis. Why most published research findings are false. PLoS medicine 2, e124 Public Library of Science, 2005. Link

  19. Michal Krawczyk. The search for significance: A few peculiarities in the distribution of P values in experimental psychology literature. PloS one 10, e0127872 (2015). Link

[Someone else is editing this]

You are editing this file