Discussion

The current reanalysis thus finds no evidence for widespread left-skew p-hacking. This might seem inconsistent with previous findings, such as the low replication rates in psychology \cite{Open_Science_Collaboration2015-zs} or high self-admission rate of p-hacking \cite{John2012-uj}. However, these results are not necessarily inconsistent because they are not mutually exclusive, as explained below.

Low replication rates could be caused by widespread p-hacking, but can also occur under systemic low power \cite{Bakker2014-lr,Bakker_2012}. Previous research has indicated low power levels in, for example, psychology \cite{Cohen1962-jc,Sedlmeier1989-yc} and randomized clinical trials \cite{Moher1994-ra}. As a consequence of low power some argue that there is a high prevalence of false positives \cite{Ioannidis2005-am}, which would result in low replication rates.

Additionally, high self-admission rates of p-hacking \cite{John2012-uj} pertain to such behaviors occurring at least once. Even if there is widespread occurrence of p-hacking across researchers, this does not necessitate frequent occurrence. In other words, a researcher might admit to having p-hacked sometime during his career, but this does not necessitate that it occurred frequently. Moreover, as noted in the introduction, not all p-hacking behaviors lead to left-skew in the p-value distribution. The method used to detect p-hacking in this paper is sensitive to only left-skew p-hacking and it is therefore possible that other types of p-hacking occur, but are not detected.

In this reanalysis two minor limitations remain with respect to the data analysis. First, selecting the bins just below .04 and .05 results in selecting non-adjacent bins. Hence, the test might be less sensitive to detecting left-skew p-hacking. In light of this limitation I ran the original analysis from Head et al., but included the second decimal, which resulted in the comparison of \(.04\leq p<.045\) versus \(.045<p\leq.05\). This analysis also yielded no evidence for left-skew p-hacking, \(Prop.=.457,p>.999,BF_{10}<.001\). Second, the selection of only exactly reported p-values might have distorted the p-value distribution due to minor rounding biases. Previous research has indicated that p-values are somewhat more likely to be rounded to .05 rather than to .04 \cite{Krawczyk2015-uh}. Therefore, selecting only exactly reported p-values might cause an underrepresentation of .05 values, because p-values are more frequently rounded and reported as \(<.05\) instead of exactly (e.g., \(p=.046\)). This limitation also applies to the original paper by Head et al. and is therefore a general, albeit minor, limitation to analyzing p-value distributions.

Conclusion

Based on the results of this reanalysis, it can be concluded that the original evidence for widespread evidence of left-skew p-hacking \cite{Head_2015} is not robust to data analytic choices made. Additionally, absence of evidence for left-skew p-hacking should not be interpreted as evidence for the total absence of left-skew p-hacking. In other words, even though no evidence for left-skew p-hacking was found, this does not mean it does not occur at all — it only indicates that it does not occur so frequently such that the aggregate distribution of significant p-values in science becomes left-skewed and that the conclusions made by Head et al. are not warranted.