Abstract
Meta-analyses often encounter studies with incompletely reported
variance measures (e.g. standard deviation values) or sample sizes, both
needed to conduct weighted meta-analyses. Here, we first present a
systematic literature survey on the frequency and treatment of missing
data in published ecological meta-analyses showing that the majority of
meta-analyses encountered incompletely reported studies. We then
simulated meta-analysis data sets to investigate the performance of 14
options to treat or impute missing SDs and/or SSs. Performance was
thereby assessed using results from fully informed weighted analyses on
(hypothetically) complete data sets. We show that the omission of
incompletely reported studies is not a viable solution. Unweighted and
sample size-based variance approximation can yield unbiased grand means
if effect sizes are independent of their corresponding SDs and SSs. The
performance of different imputation methods depends on the structure of
the meta-analysis data set, especially in the case of correlated effect
sizes and standard deviations or sample sizes. In a best-case scenario,
which assumes that SDs and/or SSs are both missing at random and are
unrelated to effect sizes, our simulations show that the imputation of
up to 90% of missing data still yields grand means and confidence
intervals that are similar to those obtained with fully informed
weighted analyses. We conclude
that multiple imputation of missing variance measures and sample sizes
could help overcome the problem of incompletely reported primary
studies, not only in the field of ecological meta-analyses. Still,
caution must be exercised in consideration of potential correlations and
pattern of missingness.
Introduction
Research synthesis aims at combining available evidence on a research
question to reach unbiased conclusions. In meta-analyses, individual
effect sizes from different studies are summarized in order to obtain a
grand mean effect size (hereafter “grand mean”) and its corresponding
confidence interval. Most of the analyses carried out in meta-analysis
and meta-regression depend on inverse-variance weighting, in which
individual effect sizes are weighted by the sampling variance of the
effect size metric in order to accommodate differences in their
precision and to separate within-study sampling error from among-study
variation. Unfortunately, meta-analyses in ecology and many other
disciplines commonly encounter missing and incompletely reported data in
original publications, especially for variance measures. Despite recent
calls towards meta-analytical thinking and comprehensive reporting,
ecological meta-analyses continue to face the issue of unreported
variances, especially when older publications are incorporated in the
synthesis.
To get an overview about the missing data in meta-analyses, and to
identify how authors of meta-analysis have dealt with this, we first
carried out a systematic survey of the ecological literature. We thereby
focussed on the most common effect sizes (standardized mean difference,
logarithm of the ratio of means, hereafter termed log response ratio,
and correlation coefficient). Meta-analysts have essentially four
options to deal with missing standard deviations (SDs) or sample sizes
(SSs). The first option is to restrict the meta-analysis to only those
effect sizes that were reported with all the necessary information and
thereby exclude all incompletely reported studies. This option
(“complete-cases analysis”) is the most often applied treatment of
missing data in published ecological meta-analyses (see Fig. 1).
However, at the very least, excluding effect sizes always means losing
potentially valuable data. Moreover, if significant findings have a
higher chance to be reported completely than non-significant results,
complete-case analysis would lead to an estimated grand mean that is
biased towards significance (i.e. reporting bias or “file-drawer
problem”). The second option is to disregard the differences in effect
size precision and thereby assign equal weights to all effect sizes.
This option (“unweighted analysis”) has also been frequently applied
in meta-analyses of log response ratios (see Fig. 1). In the case that
no SDs are available but SSs are reported, a third option is to estimate
effect size weights from the SS information alone (see eqn 1,
nc and nt denominate sample sizes of the
control and treatment group, respectively). This “sample-size-weighted
analysis” depends on the assumption that effects obtained with larger
sample size will be more precise than those obtained from a low number
of replicates. This weighting scheme has only rarely been applied (see
Fig. 1).
eqn 1\(\text{var}_{\text{approx}}=\ \frac{n_{t}+n_{c}}{n_{t}*n_{c}}\)
The fourth option is to estimate, i.e. impute, missing values on the
basis of the reported ones. In order to incorporate the uncertainty of
the estimates those imputations should be repeated multiple times. When
each of the imputed datasets is analysed separately, the obtained
results can then be averaged (“pooled”) to obtain grand mean estimates
and confidence intervals that incorporate the heterogeneity in the
imputed values.
Various previous studies have suggested that multiple imputations can
yield grand mean estimates that are less biased than those obtained from
complete-case analyses. Multiple imputation of missing data can increase
the number of synthesized effect sizes and thereby the precision of the
grand mean estimate or of subgroup mean effect sizes. Imputed data sets
permit the testing of hypotheses that could not be tested with the
smaller subset of completely reported effect sizes (e.g. on the factors
that account for differences in effect sizes).
Despite those advantages, we speculate that the multiple imputation of
missing SDs and SSs has not yet become widely implemented in ecological
meta-analyses, partly because the necessary methods did become available
only recently and partly because, from our own experience, it can be
difficult to decide on the best imputation method if one assumes that
the meta-analysis dataset might harbour hidden correlation structures.
Such correlations could comprise relationships between effect sizes and
SDs or SSs. In 1976, Rubin already defined three distinct processes that
could lead to different observed patterns of missing data. If data (in
our study SDs and SSs) are omitted completely by chance, the resulting
pattern is coined as missing completely at random . If the chance
of being omitted correlates with another covariate (in our study with
effect sizes), the pattern is called missing at random . If the
chance of being omitted directly correlates with the value of the data
(in our study with SS and SD values), this is denoted as missing
not at random .
Consequently, our second goal was to conduct an evaluation of imputation
methods for missing SDs or SSs studying the most common effect sizes in
ecological meta-analyses (standardizes mean differences, log response
ratios and correlation coefficients). Previous studies that compared the
effects of different imputation methods focused on a limited number of
imputation methods and were conducted on published data sets . In order
to systematically determine the effects of correlation structures and
patterns of missingness on the performance of different imputation
methods, we here simulated data sets that harboured four different
correlation structures. This allows to comparing the rigor of the 14
options to treat missing SDs and SSs, c.f. Table 1. We assessed the
performance of those 14 options by comparing the resulting grand means
and confidence intervals against the estimates obtained from a fully
informed weighted meta-analysis of the very same data sets. With this
approach, we provide the currently most complete overview over the most
common and easy to apply options to treat missing values in
meta-analysis data sets. We aim to show how the treatment, proportion
and correlation structure of missing SDs and SSs can drive grand means
and their confidence intervals to deviate from the results of fully
informed weighted meta-analyses.