Reanalyzing Head et al. (2015): No widespread p-hacking after all?

Abstract

Statistical significance seeking (i.e., p-hacking) is a serious problem for the validity of research, especially if it occurs frequently. Head et al. provided evidence for widespread p-hacking throughout the sciences, which would indicate that the validity of science is in doubt. Previous substantive concerns about their selection of p-values indicated they were too liberal in selecting all reported p-values, which would result in including results that would not be interesting to have been p-hacked. Despite this liberal selection of p-values Head et al. found evidence for p-hacking, which raises the question why p-hacking was detected despite it being unlikely a priori. In this paper I reanalyze the original data and indicate Head et al. their results are an artefact of rounding in the reporting of p-values.

Megan Head and colleagues (Head 2015) provide a large collection of p-values that, from their perspective, indicates widespread statistical significance seeking (i.e., p-hacking) throughout the sciences. Concerns have been raised about their selection of p-values, which was deemed too liberal and unlikely to find p-hacking to begin with (Simonsohn 2015), which raises the question why evidence for p-hacking was found by Head et al. nonetheless. The analyses that form the basis of their conclusions operate on the tenet that p-hacked papers show p-value distributions that are left skew below .05 (Simonsohn 2014). In this paper I evaluate their selection choices and analytic strategy in the original paper and suggest that Head et al. found widespread p-hacking as an artefact of rounding. Analysis files for this paper are available at https://osf.io/sxafg/.

The p-value distribution of a set of heterogeneous results, as collected by Head et al., should be a mixture distribution of only the uniform p-value distribution under the null hypothesis \(H_0\) and right-skew p-value distributions under the alternative hypothesis \(H_1\). Questionable, p-hacking behaviors affect the p-value distribution. An example is optional stopping, which causes a bump of p-values just below .05 only if the null hypothesis is true (Lakens 2014).

Head et al. correctly argue that an aggregate p-value distribution could show a bump below .05 if optional stopping under the null, or other behaviors seeking just significant results, occurs frequently. Consequently, a bump below .05 (i.e., left-skew), is a sufficient condition for the presence of specific forms of p-hacking. However, this bump below .05 is not a necessary condition, because other types of p-hacking do not cause such a bump. For example, one might use optional stopping when there is a true effect (Lakens 2014) or conduct multiple analyses, but only report that which yielded the smallest p-value. Therefore, if no bump is found, this does not exclude that p-hacking occurs at a large scale.

This paper is structured into three parts: (i) explaining the data analytic strategy of the reanalysis, (ii) reevaluating the evidence for left-skew p-hacking based on the reanalysis, and (iii) discussing the findings in light of the literature.