\documentclass{article} \usepackage[affil-it]{authblk} \usepackage{graphicx} \usepackage[space]{grffile} \usepackage{latexsym} \usepackage{textcomp} \usepackage{longtable} \usepackage{tabulary} \usepackage{booktabs,array,multirow} \usepackage{amsfonts,amsmath,amssymb} \providecommand\citet{\cite} \providecommand\citep{\cite} \providecommand\citealt{\cite} \usepackage{url} \usepackage{hyperref} \hypersetup{colorlinks=false,pdfborder={0 0 0}} \usepackage{etoolbox} \makeatletter \patchcmd\@combinedblfloats{\box\@outputbox}{\unvbox\@outputbox}{}{% \errmessage{\noexpand\@combinedblfloats could not be patched}% }% \makeatother % You can conditionalize code for latexml or normal latex using this. \newif\iflatexml\latexmlfalse \AtBeginDocument{\DeclareGraphicsExtensions{.pdf,.PDF,.eps,.EPS,.png,.PNG,.tif,.TIF,.jpg,.JPG,.jpeg,.JPEG}} \usepackage[utf8]{inputenc} \usepackage[ngerman,english]{babel} \begin{document} \title{Reanalyzing Head et al. (2015): No widespread p-hacking after all?} \author{C.H.J. Hartgerink} \affil{Tilburg University} \date{\today} \maketitle Megan Head and colleagues \cite{Head_2015} provide a large collection of p-values that, from their perspective, indicates widespread statistical significance seeking (i.e., p-hacking) throughout the sciences. Concerns have been raised about their selection of p-values, which was deemed too liberal and unlikely to find p-hacking to begin with \cite{Simonsohn2015-av}, which raises the question why evidence for p-hacking was found by Head et al. nonetheless. The analyses that form the basis of their conclusions operate on the tenet that p-hacked papers show p-value distributions that are left skew below .05 \cite{Simonsohn2014}. In this paper I evaluate their selection choices and analytic strategy in the original paper and suggest that Head et al. found widespread p-hacking as an artefact of rounding. Analysis files for this paper are available at \href{https://osf.io/sxafg/}{https://osf.io/sxafg/}. The p-value distribution of a set of heterogeneous results, as collected by Head et al., should be a mixture distribution of only the uniform p-value distribution under the null hypothesis $H_0$ and right-skew p-value distributions under the alternative hypothesis $H_1$. Questionable, p-hacking behaviors affect the p-value distribution. An example is optional stopping, which causes a bump of p-values just below .05 only if the null hypothesis is true \cite{Lakens_2014}. Head et al. correctly argue that an aggregate p-value distribution could show a bump below .05 if optional stopping under the null, or other behaviors seeking just significant results, occurs frequently. Consequently, a bump below .05 (i.e., left-skew), is a sufficient condition for the presence of specific forms of p-hacking. However, this bump below .05 is not a necessary condition, because other types of p-hacking do not cause such a bump. For example, one might use optional stopping when there is a true effect \cite{Lakens_2014} or conduct multiple analyses, but only report that which yielded the smallest p-value. Therefore, if no bump is found, this does not exclude that p-hacking occurs at a large scale. This paper is structured into three parts: (i) explaining the data analytic strategy of the reanalysis, (ii) reevaluating the evidence for left-skew p-hacking based on the reanalysis, and (iii) discussing the findings in light of the literature. \section*{Reanalytic strategy} Head and colleagues their data analytic strategy focused on comparing frequencies in the last and penultimate bins from .05 at a binwidth of .005. Based on the tenet that p-hacking introduces a left-skew p-distribution \cite{Simonsohn2014}, evidence for p-hacking is present if the last bin has a sufficiently higher frequency than the penultimate one in a binomial test. Applying the binomial test to two frequency bins has previously been used in publication bias research and is typically called a Caliper test \cite{gerber2010, kuhberger2014}, applied here specifically to test for left-skew p-hacking. The two panels in Fig 1 describe the selection of p-values in the original and current paper. The top panel shows the selection made by Head et al. (i.e., $.04
1$ can be interpreted, for our purposes, as: the data are more likely under left-skew p-hacking than under no left-skew p-hacking. $BF_{10}$ values $<1$ indicate that the data are more likely under no left-skew p-hacking than under left-skew p-hacking. The further removed from $1$, the more evidence in the direction of either one hypothesis, which were assumed to be equally likely in the prior distribution. For the current analyses, equal priors were assumed. \section*{Reanalysis results} Results of the reanalysis indicate that no evidence for left-skew p-hacking remains when we take into account a second-decimal reporting bias. Initial sensitivity analyses using the original analysis script strengthened original results after eliminating DOI selection and using $p\leq.05$ as selection criterion instead of $p<.05$. However, as explained above, this result is confounded due to not taking into account the second decimal. Reanalyses across all disciplines showed no evidence for left-skew p-hacking, $Prop.=.417,p>.999, BF_{10}<.001$ for the Results sections and $Prop.=.358,p>.999,BF_{10}<.001$ for the Abstract sections. These results are not dependent on binwidth .00125, as is seen in Table 1 where results for alternate binwidths are shown. Separated per discipline, no binomial test for left-skew p-hacking is statistically significant in either the Results- or Abstract sections (see S1 File). This indicates that the effect found originally by Head and colleagues does not hold when we take into account that reported p-values show reporting bias at the second decimal.\selectlanguage{english} \begin{table}[htbp] \begin{tabular}{cccc} & & Abstracts & Results \\ \hline Binwidth = .00125 & ($.03875-.04$) & 4597 & 26047 \\ & ($.04875-.05$) & 2565 & 18664 \\ & $Prop.$ & 0.358 & 0.417 \\ & $p$ & $>$.999 & $>$.999 \\ & $BF_{10}$ & $<$.001 & $<$.001 \\ Binwidth = .005 & ($.035-.04$) & 6641 & 38537 \\ & ($.045-.05$) & 4485 & 30406 \\ & $Prop.$ & 0.403 & 0.441 \\ & $p$ & $>$.999 & $>$.999 \\ & $BF_{10}$ & $<$.001 & $<$.001 \\ Binwidth = .01 & ($.03-.04$) & 9885 & 58809 \\ & ($.04-.05$) & 7250 & 47755 \\ & $Prop.$ & 0.423 & 0.448 \\ & $p$ & $>$.999 & $>$.999 \\ & $BF_{10}$ & $<$.001 & $<$.001 \\ \end{tabular} \caption{{Table 1. Results of reanalysis across various binwidths (i.e., .00125, .005, .01).}} \end{table} \section*{Discussion} The current reanalysis thus finds no evidence for widespread left-skew p-hacking. This might seem inconsistent with previous findings, such as the low replication rates in psychology \cite{Open_Science_Collaboration2015-zs} or high self-admission rate of p-hacking \cite{John2012-uj}. However, these results are not necessarily inconsistent because they are not mutually exclusive, as explained below. Low replication rates could be caused by widespread p-hacking, but can also occur under systemic low power \cite{Bakker2014-lr,Bakker_2012}. Previous research has indicated low power levels in, for example, psychology \cite{Cohen1962-jc,Sedlmeier1989-yc} and randomized clinical trials \cite{Moher1994-ra}. As a consequence of low power some argue that there is a high prevalence of false positives \cite{Ioannidis2005-am}, which would result in low replication rates. Additionally, high self-admission rates of p-hacking \cite{John2012-uj} pertain to such behaviors occurring at least once. Even if there is widespread occurrence of p-hacking across researchers, this does not necessitate frequent occurrence. In other words, a researcher might admit to having p-hacked sometime during his career, but this does not necessitate that it occurred frequently. Moreover, as noted in the introduction, not all p-hacking behaviors lead to left-skew in the p-value distribution. The method used to detect p-hacking in this paper is sensitive to only left-skew p-hacking and it is therefore possible that other types of p-hacking occur, but are not detected. In this reanalysis two minor limitations remain with respect to the data analysis. First, selecting the bins just below .04 and .05 results in selecting non-adjacent bins. Hence, the test might be less sensitive to detecting left-skew p-hacking. In light of this limitation I ran the original analysis from Head et al., but included the second decimal, which resulted in the comparison of $.04\leq p<.045$ versus $.045
.999,BF_{10}<.001$. Second, the selection of only exactly reported p-values might have distorted the p-value distribution due to minor rounding biases. Previous research has indicated that p-values are somewhat more likely to be rounded to .05 rather than to .04 \cite{Krawczyk2015-uh}. Therefore, selecting only exactly reported p-values might cause an underrepresentation of .05 values, because p-values are more frequently rounded and reported as $<.05$ instead of exactly (e.g., $p=.046$). This limitation also applies to the original paper by Head et al. and is therefore a general, albeit minor, limitation to analyzing p-value distributions. \section*{Conclusion} Based on the results of this reanalysis, it can be concluded that the original evidence for widespread evidence of left-skew p-hacking \cite{Head_2015} is not robust to data analytic choices made. Additionally, absence of evidence for left-skew p-hacking should not be interpreted as evidence for the total absence of left-skew p-hacking. In other words, even though no evidence for left-skew p-hacking was found, this does not mean it does not occur at all --- it only indicates that it does not occur so frequently such that the aggregate distribution of significant p-values in science becomes left-skewed and that the conclusions made by Head et al. are not warranted. \section*{Acknowledgments} Joost de Winter, Marcel van Assen, Robbie van Aert, Mich\selectlanguage{ngerman}èle Nuijten, and Jelte Wicherts provided fruitful discussion or feedback on the ideas presented in this paper. The end result is the author\selectlanguage{english}'s sole responsibility. \section*{Supporting Information} \href{https://github.com/chartgerink/2015head/raw/master/submission/round\%201\%20review/S1\%20Reanalysis\%20results\%20per\%20discipline.xlsx}{\textbf{S1 File. Full reanalysis results per discipline.}} \selectlanguage{english} \FloatBarrier \bibliographystyle{plain} \bibliography{bibliography/converted_to_latex.bib% } \end{document}