PREPRINT authorea.com/31568
Published to PeerJ Preprints at September 12th, 2016

Abstract

Statistical significance seeking (i.e., p-hacking) is a serious problem for the validity of research, especially if it occurs frequently. Head et al. provided evidence for widespread p-hacking throughout the sciences, which would indicate that the validity of science is in doubt. Previous substantive concerns about their selection of p-values indicated they were too liberal in selecting all reported p-values, which would result in including results that would not be interesting to have been p-hacked. Despite this liberal selection of p-values Head et al. found evidence for p-hacking, which raises the question why p-hacking was detected despite it being unlikely a priori. In this paper I reanalyze the original data and indicate Head et al. their results are an artefact of rounding in the reporting of p-values.

Megan Head and colleagues (Head 2015) provide a large collection of p-values that, from their perspective, indicates widespread statistical significance seeking (i.e., p-hacking) throughout the sciences. Concerns have been raised about their selection of p-values, which was deemed too liberal and unlikely to find p-hacking to begin with (Simonsohn 2015), which raises the question why evidence for p-hacking was found by Head et al. nonetheless. The analyses that form the basis of their conclusions operate on the tenet that p-hacked papers show p-value distributions that are left skew below .05 (Simonsohn 2014). In this paper I evaluate their selection choices and analytic strategy in the original paper and suggest that Head et al. found widespread p-hacking as an artefact of rounding. Analysis files for this paper are available at https://osf.io/sxafg/.

The p-value distribution of a set of heterogeneous results, as collected by Head et al., should be a mixture distribution of only the uniform p-value distribution under the null hypothesis $$H_0$$ and right-skew p-value distributions under the alternative hypothesis $$H_1$$. Questionable, p-hacking behaviors affect the p-value distribution. An example is optional stopping, which causes a bump of p-values just below .05 only if the null hypothesis is true (Lakens 2014).

Head et al. correctly argue that an aggregate p-value distribution could show a bump below .05 if optional stopping under the null, or other behaviors seeking just significant results, occurs frequently. Consequently, a bump below .05 (i.e., left-skew), is a sufficient condition for the presence of specific forms of p-hacking. However, this bump below .05 is not a necessary condition, because other types of p-hacking do not cause such a bump. For example, one might use optional stopping when there is a true effect (Lakens 2014) or conduct multiple analyses, but only report that which yielded the smallest p-value. Therefore, if no bump is found, this does not exclude that p-hacking occurs at a large scale.

This paper is structured into three parts: (i) explaining the data analytic strategy of the reanalysis, (ii) reevaluating the evidence for left-skew p-hacking based on the reanalysis, and (iii) discussing the findings in light of the literature.

# Reanalytic strategy

Head and colleagues their data analytic strategy focused on comparing frequencies in the last and penultimate bins from .05 at a binwidth of .005. Based on the tenet that p-hacking introduces a left-skew p-distribution (Simonsohn 2014), evidence for p-hacking is present if the last bin has a sufficiently higher frequency than the penultimate one in a binomial test. Applying the binomial test to two frequency bins has previously been used in publication bias research and is typically called a Caliper test (Gerber 2010, Kühberger 2014), applied here specifically to test for left-skew p-hacking.

The two panels in Fig 1 describe the selection of p-values in the original and current paper. The top panel shows the selection made by Head et al. (i.e., $$.04<p< .045$$ versus $$.045<p<.05$$), where the right bin shows a slightly higher frequency than the left bin. This is the evidence Head et al. found for p-hacking. However, if we expand the range and look at the entire distribution, we see that this is an unrepresentative part of the distribution of significant p-values.