Reanalyzing Head et al. (2015): No widespread p-hacking after all?
Statistical significance seeking (i.e., p-hacking) is a serious problem for the validity of research, especially if it occurs frequently. Head et al. provided evidence for widespread p-hacking throughout the sciences, which would indicate that the validity of science is in doubt. Previous substantive concerns about their selection of p-values indicated they were too liberal in selecting all reported p-values, which would result in including results that would not be interesting to have been p-hacked. Despite this liberal selection of p-values Head et al. found evidence for p-hacking, which raises the question why p-hacking was detected despite it being unlikely a priori. In this paper I reanalyze the original data and indicate Head et al. their results are an artefact of rounding in the reporting of p-values.
Megan Head and colleagues (Head 2015) provide a large collection of p-values that, from their perspective, indicates widespread statistical significance seeking (i.e., p-hacking) throughout the sciences. Concerns have been raised about their selection of p-values, which was deemed too liberal and unlikely to find p-hacking to begin with (Simonsohn 2015), which raises the question why evidence for p-hacking was found by Head et al. nonetheless. The analyses that form the basis of their conclusions operate on the tenet that p-hacked papers show p-value distributions that are left skew below .05 (Simonsohn 2014). In this paper I evaluate their selection choices and analytic strategy in the original paper and suggest that Head et al. found widespread p-hacking as an artefact of rounding. Analysis files for this paper are available at https://osf.io/sxafg/.
The p-value distribution of a set of heterogeneous results, as collected by Head et al., should be a mixture distribution of only the uniform p-value distribution under the null hypothesis \(H_0\) and right-skew p-value distributions under the alternative hypothesis \(H_1\). Questionable, p-hacking behaviors affect the p-value distribution. An example is optional stopping, which causes a bump of p-values just below .05 only if the null hypothesis is true (Lakens 2014).
Head et al. correctly argue that an aggregate p-value distribution could show a bump below .05 if optional stopping under the null, or other behaviors seeking just significant results, occurs frequently. Consequently, a bump below .05 (i.e., left-skew), is a sufficient condition for the presence of specific forms of p-hacking. However, this bump below .05 is not a necessary condition, because other types of p-hacking do not cause such a bump. For example, one might use optional stopping when there is a true effect (Lakens 2014) or conduct multiple analyses, but only report that which yielded the smallest p-value. Therefore, if no bump is found, this does not exclude that p-hacking occurs at a large scale.
This paper is structured into three parts: (i) explaining the data analytic strategy of the reanalysis, (ii) reevaluating the evidence for left-skew p-hacking based on the reanalysis, and (iii) discussing the findings in light of the literature.
Head and colleagues their data analytic strategy focused on comparing frequencies in the last and penultimate bins from .05 at a binwidth of .005. Based on the tenet that p-hacking introduces a left-skew p-distribution (Simonsohn 2014), evidence for p-hacking is present if the last bin has a sufficiently higher frequency than the penultimate one in a binomial test. Applying the binomial test to two frequency bins has previously been used in publication bias research and is typically called a Caliper test (Gerber 2010, Kühberger 2014), applied here specifically to test for left-skew p-hacking.
The two panels in Fig 1 describe the selection of p-values in the original and current paper. The top panel shows the selection made by Head et al. (i.e., \(.04<p< .045\) versus \(.045<p<.05\)), where the right bin shows a slightly higher frequency than the left bin. This is the evidence Head et al. found for p-hacking. However, if we expand the range and look at the entire distribution, we see that this is an unrepresentative part of the distribution of significant p-values.