We would like to thank the reviewers for their insightful comments. The major points that have been addressed are as follows:
It was not our intention to give the impression that one needs to scan human calibration phantoms at each site to properly power a multisite study with nonstandardized parameters, which is very costly. The statistical model which takes MRI bias into account has been emphasized instead. The bias that was measured and validated via calibration served to corroborate the scaling assumption of the statistical model. For other researchers planning multisite studies, the statistical model we proposed with the biases we reported should help plan and power a study.
Our measurements have been compared with other harmonization efforts, specifically (Cannon 2014, Jovicich 2013) and (Schnack 2004).
The scanning parameters of our consortium have been better specified.
The independence assumption between the unobserved effect and the scaling factor for a particular site have been addressed. Specifically, we emphasized that this assumption could hold for MS patients based on our experiment. The need to validate this assumption for other situations by scanning human phantoms was recommended, and the equation of variance without the independence assumption has been provided for the readers.
##Major Points
The papers cited here add evidence for our hypothesis that different ROIs will have different scaling factors, due to regional differences in contrast, as shown in a VBM analysis by Tardiff and colleagues. The section has been edited as follows:
“We hypothesized that all differences in regional contrast and geometric distortion result in regional volumes that are consistently and linearly scaled from their true value. For a given region of interest (ROI), two mechanisms impact the final boundary definition: gradient nonlinearities cause distortion and, simultaneously, hardware (including scanner, field strength, and coils) and acquisition parameters modulate tissue contrast, adjusting the whole boundary. Previously, Tardiff and colleagues have found that contrast-to-noise ratio and contrast inhomogeneity from various pulse sequences and scanner strengths cause regional biases in VBM(Tardif 2010, Tardif 2009), and therefore we hypothesized that each ROI will scale differently at each site. By imaging 12 subjects in 20 different scanners using varying acquisition schemes, we were able to estimate the scaling factor for each regional volume at each site...”
In (Streitbürger 2014), the researchers found that different pulse sequences and hardware (12 ch vs 32ch RF coil) altered the estimate of the coefficient on gray matter density in an age regression. This follows from our hypothesis on the scaling property of regional volumes from MRI. \(D_{u,j}\) is the unobserved effect - in the case of (Streitbürger 2014), the true GM density dependence on age that cannot be measured directly. We have proposed that the reason for the differences seen in regional volume analyses (and possibly VBM) was due to different scaling biases, \(a_j\), for different pulse sequences. It is definitely true that \(a_j\) and \(D_{Y,j}\)(the observed effect) are correlated, but the assumption of independence of the unobserved effect (\(D_{U,j}\)) with \(a_j\) needs to be validated, because it is possible that scaling factors may be different for different disease groups. To verify this assumption, we calculated scaling factors on MS patients and healthy controls and could not detect differences between them. Therefore, in the case of healthy controls (aged 24-57) and MS patients, \(a_j\) is only dependent on scanner hardware/acquisition parameters.
The -3T flag was not used because differences in Freesurfer processing of other datasets were not observed in our lab. The parameters have been specified as follows:
“A neuroradiologist reviewed all images to screen for major artifacts and pathology. The standard Freesurfer (Fischl 2002) version 5.3.0 cross-sectional pipeline (recon-all) was run on each site’s native T1-weighted protocol, using the RedHat 7 operating system. Both 1.5T and 3T scans were run with the same parameters (without using the -3T flag), meaning that the non-uniformity correction parameters were kept at the default values.”
A few scans failed initial QC due to small amounts of dura classified as brain at the top of the head that required editing and were mistakenly excluded. These scans are now included and do not change the results (in particular, \(CV_a\), as this is the most important for power calculations). We have included the results from the revised dataset in the manuscript. For the new QC procedure, eTIV was checked as suggested by the reviewer; this QC check is advantageous because it is the first step of recon-all and is sensitive to misregistration to the talairach template that could lead to inaccurate segmentations overall. Finally only 3 scans were rejected due to failing QC at the registration step, which were from the same subject at sites 4 and 6 and second subject at site 11. Three other scans were not included due to acquisition/ data transfer issues: One scan was not acquired due to time constraints, one scan was left out because part of the head was not acquired (possibly the files were corrupted), and one scan was erroneously deleted at the site. A column has been added in the acquisition tables to reflect how many actual scans were acquired at each site.
Freesurfer uses an atlas scaling factor (ASF) to calculate the eTIV, so even though the ROI encompasses all others, it is not calculated in the same way as other ROIs (i.e. by counting the voxels inside the skull). Rather, it estimates a linear transformation to an atlas space (Talairach) and calculates the determinant of the transformation matrix, which correlates to ICV (Buckner 2004). So actually the mechanisms behind the biases in eTIV are different than those behind other regional volumes (i.e. regional variation in CNR and gradient distortions affect the boundary). Because of this, test retest reliability of eTIV was included, but was excluded from the scaling factor calculations and validations.
Caramanos 2010 looked specifically at repositioning in relation to SIENA, which is a longitudinal analysis, but one could assume that repositioning affected the scan-rescan variability of some ROIs in our cross-sectional analysis. We have assumed that the consistency of positioning varies between sites, because a standardized positioning protocol was not defined. A big reason for this was to be able to incorporate data that was previously acquired for future studies. We have some measure of positioning consistency that was reflected in the amount of variability due to the site-by-run interaction term in our ICC calculation, since variability in positioning would contribute to the variability of volumes between runs at particular sites. This variance component was much smaller than the variance due to subjects/sites/subjects x site/unexplained, and the specific variance component for run x site was reported for each ROI in the supplemental materials. Overall, the majority of the variance was explained by subject, site, and the subject x site interaction than by the site by run term.
In the paper, scaling factor estimates were compared for subcortical gray matter volume, white matter volume and total gray matter volume, because these were specifically edited, and because they are the most relevant for future studies on MS and genetics that this consortium is proposing. The largest difference between scaling factors between healthy controls and MS was for white matter volume, and a two sample t-test between the scaling factors yielded a p-value of 0.88. In the paper, the following was added:
“Even though this study calculated scaling factors from healthy controls, we showed that scaling factors derived from an MS population were very similar or identical to those derived from healthy controls. The largest difference in scaling factors between HC versus MS patients was for white matter volume, where \(a_{MS} = .967\) and \(a_{HC} = .975\). A two-sample T test between the scaling factors resulted in a p-value of \(.88\), showing that we could not detect a significant difference in scaling factors between HC and MS. This part of the study was limited in that we only scanned patients at two scanners, while the healthy controls were scanned at 20. However, the similarity between scaling factors for the subcortical gray matter, cortical gray matter and white matter volumes between the MS and HC populations suggests that, given careful editing of volumes on the disease population, the independence assumption holds for MS. For researchers studying other diseases, it may be useful to scan healthy controls and patients before and after an upgrade or sequence change to test the validity of the independence assumption.”
The overall goal of this project was not to claim that the method of scanning 12 phantom subjects was cost effective. Rather, the goal was to measure MRI-related biases when systems are not standardized, and then see how one can overcome these biases with proper sample sizes, rather than a costly calibration method or harmonization (for the case of retrospective data). This also allows sites the freedom to upgrade hardware/software or even change sequences during a study. This might be an incentive for sites to contribute data even if they are given little financial support. The phantom calibration aspect has been minimized and our statistical model that accounts for MRI-related biases has been emphasized. The measurements of that bias (which were estimated via calibration) are an important part of this study because they validate the scaling assumption of the statistical model and provide researchers values to plug into the power equation. Our framework provides an alternative method to ADNI harmonization, rather than a strict improvement. The human phantom calibration showed that the overall absolute agreement between sites improves to the same level of ADNI-type harmonization. Our results are compared to other harmonization efforts in the manuscript and in the following response.
Colleagues who work on ADNI data at UCSF verified that the same healthy controls were not scanned at multiple sites, and so the method is not directly comparable. Instead our results were compared to 3 other harmonization efforts. In the first, our between site ICC results pre- and post- calibration were compared to the between-site ICC results of (Cannon 2014). In (Cannon 2014), the sites were harmonized and an ADNI phantom was used to correct gradient distortions. The authors ran FSL’s FIRST for subcortical segmentation, and a cortical pattern matching algorithm for gray/white segmentation. They then calculated the between-site ICC using variance components similar to our study. The table below was added to the manuscript:
ROI | ICC BW | ICC BW Cal | (Cannon 2014) ICC BW |
---|---|---|---|
GMV | .78 | .96 | .854 |
WMV | .96 | .98 | .774 |
Thal | .61 | .73 | .95 |
Hipp | .75 | .84 | .79 |
Amyg | .56 | .74 | .76 |
Caud | .82 | .91 | .92 |
\label{comparetocannon}
The two ROIs that do not compare to (Cannon 2014) are the thalamus and the white matter volume (WMV). For the thalamus, it is possible that the FIRST algorithm is more reliable at segmenting this structure, while for white matter volume, Freesurfer is more reliable. Another table by (Jovicich 2013) was included where sites were not strictly harmonized, but different control subjects were scanned at each site, and the authors used the same freesurfer cross-sectional algorithm that we used. Instead of calculating between-site ICC’s, they calculated the average within-site ICCs for each ROI. The following table (which is now included in the manuscript) compares our within-site ICC’s pre- and post- calibraiton to (Jovicich 2013) average within-site ICC values:
ROI | ICC WI | ICC WI Cal | (Jovicich 2013) ICC WI Average |
---|---|---|---|
LV | 1 | 1 | \(.998 \pm 0.002\) |
Thal | .86 | .84 | \(0.765 \pm .183\) |
Hipp | .93 | .93 | \(0.878 \pm .132\) |
Amyg | .89 | .86 | \(0.761 \pm .134\) |
Caud | .97 | .97 | \(0.909 \pm 0.092\) |
Here, we see that the within-site thalamus ICC values fall within the range of (Jovicich 2013), along with the other ROIs. Again, the calibration done here is to show that when the biases are applied to the data, the results are comparable to the harmonized results, and therefore the variability of biases (\(CV_a\)) could be trusted.
Finally, the following comparison to Schnack 2004 was added:
“The data acquisition of our study is similar to that of (Schnack 2004), in which the researchers acquired T1-weighted images from 8 consistent human phantoms across 5 sites with non-standardized protocols. These scanners were all 1.5T except for one 1T scanner. (Schnack 2004) calibrated the intensity histograms of the images before segmentation with a calibration factor estimated based on the absolute agreement of volumes to the reference site (ICC). After applying their calibration method, the ICC of the lateral ventricle was \(\geq 0.96\), which is similar to our pre- and post- calibrated result of \(0.97\). The ICC for the intensity calibrated gray matter volume in (Schnack 2004) was \(\geq 0.84\), compared to our calibrated between-site ICC of \(0.78\) (uncalibrated), and \(0.96\) (calibrated). Our between-site ICCs for white matter volume (\(0.96\) and \(0.98\) for the pre- and post- calibrated volumes, respectively) were much higher than those of the intensity calibrated white matter volume in (Schnack 2004) (\(\geq .78\)). This could be explained by the fact that our cohort of sites is a consortium studying multiple sclerosis, which is a white matter disease, so there may be a bias toward optimizing scan parameters for white matter. Most importantly, the calibration method of (Schnack 2004) requires re-acquisition of a human phantom cohort at each site for each multisite study. Alternatively, multisite studies employing our approach can use the results of our direct-volume calibration (the estimates of \(CV_a\) for each ROI) to estimate sample sizes based on our proposed power equation and bias measurements without acquiring their own human phantom dataset to use in calibration.”
The following confirmation of IRB approval was added to the Methods section:
“Institutional approval was acquired and signed consent was obtained for each subject at each site.”
The number of phantom subjects do not contribute to the power equation in Figure 1. The power equation that was derived does not account for any sort of calibration or scaling. Instead, it requires an estimate for \(CV_a\), which is the variability of scaling biases between sites. Our consortium was not asked to change protocols, so the true bias variability might be higher than that of a harmonized study, and sample sizes we would calculate would be smaller than what is actually needed if harmonized data was used to calculate bias. For other researchers planning multisite studies, our measurements of \(CV_a\) could be used to get a better power estimate for protocols that are close to the 3D-MPRAGE protocols that were described here. This is especially useful for the inclusion of retrospective data.
To power an estimate of \(CV_a\), two sample sizes are required - the subject level, and the site level. The sample size of 12 is enough to estimate the slope of the calibration line (which is around 1), because the scan-rescan reliability of the ROIs is very high (around 0.9). Ideally, we would have more than 12 subjects to get better a better estimate of the intercept, and this was listed as a study limitation in the discussion. 20 sites were sampled because these same sites will be used in a future study on MS genetics and MRI phenotypes, although the inclusion of more sites would also improve the \(CV_a\) estimate.
In the text, it was emphasized that the derived power equation does not require phantom subjects or calibration to be used:
“We emphasize that the use of phantom subjects does not directly contribute to the power equation in Figure 1, as it does not account for any sort of calibration or scaling. However, it requires an estimate for \(CV_a\), which is the variability of scaling biases between sites. The goal of this study is to provide researchers with estimates of \(CV_a\) from our set of calibration phantoms and our set of non-standardized MRI acquisitions. For a standardized set of scanners, these values may be considered an upper bound. ”
Because ROIs display left-right asymmetry, the analysis was re-run so that each scaling factor was estimated for each ROI on the left and right hemispheres separately. In order to compare to previous research (Cannon 2014, Jovicich 2013), the ICCs were averaged across hemispheres in tables 6 and 7. In table 5, the \(CV_a\) values were shown for each hemisphere separately.
The assumption of independence between the unobserved difference between groups and the scaling factor of a site is certainly invalid if people who have more disease have tissue with different properties that affect regional contrast. It would mean that the scaling factor is correlated with the unobserved effect because different patient groups have different scaling factors. In dropping the covariance terms here, power would be overestimated because the denominator of the non-centrality parameter would be larger. We scanned MS patients to see how different the scaling factors were to address this concern. The scaling factors were very similar and in one case, identical, with the caveat that the images were QC’d more carefully to check and correct for gross segmentation missclassifications. The full equation for the calculation of variance has been added to the appendix, which includes the covariance terms before they are dropped:
\[\begin{split} var[XY] = \mu_Y^2var[X] + \mu_X^2var[Y] + 2\mu_X\mu_Ycov[X,Y] \\ - (cov[X,Y])^2 + E[(X-\mu_X)^2(Y-\mu_Y)^2] \\ + 2\mu_YE[(X-\mu_X)^2(Y-\mu_Y)] + 2\mu_XE[(X-\mu_X)(Y-\mu_Y)^2] \end{split}\]
Since scaling factor differences were not detected between MS patients and healthy controls, we do not think an approximation of covariance is within the scope of this paper. It is necessary for researchers to investigate if this independence assumption is strongly violated for the particular disease they are studying. The point that this assumption is valid for MS only was discussed more thoroughly in the discussion section.
The formulas were changed as follows:
\[L_1 = \prod\limits_j\frac{1}{\sqrt{2\pi a_j^2 V_j}}exp\Big(\frac{-(\beta_{1j}- \mu)^2}{2 a_j^2 V_j}\Big)\]
and
\[\begin{gathered} \frac{\partial}{\partial \mu}\Big(log(L_1)\Big) = \frac{\partial}{\partial \mu} \Big(\sum\limits_j^J \frac{1}{\sqrt{2 \pi a_j^2 V_j}} + \sum\limits_j^J \frac{(\beta_{1j} - \mu)^2}{2 a_j^2V_j}\Big) = 0 \implies \hat{\mu} = \frac{\sum\limits_j^J a_j^{-2}V_j^{-1}\beta_{1j}}{\sum\limits_j^J a_j^{-2}V_j^{-1}} \end{gathered}\]
The formula has been changed to:
\[\begin{gathered} \frac{\partial}{\partial \mu}\Big(log(L_1)\Big) = \frac{\partial}{\partial \mu} \Big(\sum\limits_j^J log(\frac{1}{\sqrt{2 \pi a_j^2 V_j}}) + \sum\limits_j^J \frac{(\beta_{1j} - \mu)^2}{2 a_j^2V_j}\Big) = 0 \implies \hat{\mu} = \frac{\sum\limits_j^J a_j^{-2}V_j^{-1}\beta_{1j}}{\sum\limits_j^J a_j^{-2}V_j^{-1}} \end{gathered}\]
The full derivation was added to the Appendix, specifically:
The proof for this is found in Introduction to the Theory of Statistics (1974) by Mood, Graybill and Boes (Boes 1974), section 2.3, Thoerem 3:
Let \(X\) and \(Y\) be two random variables where \(var[XY]\) exists, then
\[\begin{split} var[XY] = \mu_Y^2var[X] + \mu_X^2var[Y] + 2\mu_X\mu_Ycov[X,Y] \\ - (cov[X,Y])^2 + E[(X-\mu_X)^2(Y-\mu_Y)^2] \\ + 2\mu_YE[(X-\mu_X)^2(Y-\mu_Y)] + 2\mu_XE[(X-\mu_X)(Y-\mu_Y)^2] \end{split}\]
which can be obtained by computing \(E[XY]\) and \(E[(XY)^2]\) when \(XY\) is expressed as
\[XY = \mu_X\mu_Y + (X-\mu_X)\mu_Y + (Y-\mu_X)\mu_X + (X-\mu_X)(Y-\mu_Y)\]
If \(X\) and \(Y\) are independent, then \(E[XY] = \mu_X\mu_Y\), the covariance terms are 0, and
\[E[(X-\mu_X)^2(Y-\mu_Y)^2] = E[(X-\mu_X)^2]E[(Y-\mu_Y)^2] = var[X]var[Y]\]
and
\[\mu_YE[(X-\mu_X)^2(Y-\mu_Y)] = E[(X-\mu_X)^2]E[(Y-\mu_Y)] = 0\]
\[\mu_XE[(Y-\mu_Y)^2(X-\mu_X)] = E[(Y-\mu_Y)^2]E[(X-\mu_X)] = 0\]
Which gives
\[var[XY] = \mu_X^2var[Y] + \mu_Y^2var[X] + var[X]var[Y]\]
More detail on how subjects were scanned was added:
“Two scans were acquired from each subject, where the subject got out of the scanner between scans for a couple minutes, and was repositioned and rescanned by the scanning technician of that particular site.”
We have added the following, to give specifics on our processing environment:
The standard Freesurfer (Fischl 2002) version 5.3.0 cross-sectional pipeline (recon-all) was run on each site’s native T1-weighted protocol, using the RedHat 7 operating system on IEEE 754 compliment hardware.
During the QC process on the MS patients, we found errors that were very different from the errors on healthy controls, due to the effect of lesions. When calculating a scaling factor, it is really important that the tissue classification is accurate, otherwise the scaling factor estimate could be incorrect. There may be a segmentation pipeline that handles lesions better, and optimizing that pipeline will happen when patient data from the consortium is acquired. To clarify this point, and to give more details on the types of errors, the text has been edited as follows:
“The accuracy of our scaling factor estimates depends on the accuracy of tissue segmentation, but the lesions in MS specifically impact white matter and pial surface segmentations. Because of the effect of lesions on Freesurfer’s tissue classification, all images were manually corrected for lesions on the T1-weighted images by a neurologist after editing by Freesurfer’s quality assurance procedure, which included extensive topological white matter corrections, white matter mask edits, and pial edits on images that were not lesion filled. These manual edits altered the white matter surface so that white matter lesions were not misclassified as gray matter or non-brain tissue. The errors in white matter segmentations most typically occurred at the border of white matter and gray matter and around the ventricles. The errors in pial surface segmentations most typically occurred near the eyes (orbitofrontal) and the superior frontal or medial frontal lobes. Images that were still misclassified after thorough edits were removed from the analysis, because segmentations were not accurate enough to produce realistic scaling factor estimates.”
This important point was also made by reviewer 2, and we have clarified the overall goal of this project. We will restate our response here. The overall goal of this project was not to claim that the method of scanning 12 phantom subjects was cost effective. Rather, the goal was to measure MRI-related biases when systems are not standardized, and then see how one can overcome these biases with proper sample sizes, rather than a costly calibration method or harmonization (for the case of retrospective data). This also allows sites the freedom to upgrade hardware/software or even change sequences during a study. This might be an incentive for sites to contribute data even if they are given little financial support. The phantom calibration aspect has been minimized and our statistical model that accounts for MRI-related biases has been emphasized. The measurements of that bias (which were estimated via calibration) are an important part of this study because they validate the scaling assumption of the statistical model and provide researchers values to plug into the power equation. Our framework provides an alternative method to ADNI harmonization, rather than a strict improvement. The human phantom calibration showed that the overall absolute agreement between sites improves to the same level of ADNI-type harmonization. Our results are compared to other harmonization efforts in the manuscript and in the following response.
Our sites have used sequences that are similar to the vendor provided-T1 sequences, and (Jovicich 2013) found that high multicenter reliability can be achieved using these standard vendor sequences with very few calibration efforts. However, many of the sites in our consortium are in the middle of longitudinal studies within their sites, and are hesitant to make even very small protocol changes, despite the result from (Jovicich 2013), which was for the longitudinal processing stream. Our statistical model was for a cross-sectional design, and the evaluation of scaling bias is important to optimize sample sizes for the cross-sectional case.
Does the method here proposed offer improved multi-centric reliability than other studies? The across site reliability measures obtained with the proposed calibration do not appear to be placed in perspective with the vast literature on this topic (for example, but not limited to: Wolz et al. 2014; Roche et al., 2014; Jovicich et al., 2013). In particular, this last study shows inter-site ICC measures on many of the same structures reported here, also obtained using Freesurfer, but with notably higher reliability than the calibrated results reported here:
Structure Between site ICC after calibration in this study (Fig. 2) Jovicich et al., 2013 (Suppl. Table 1)
Lateral ventricle 0.96 0.998
Thalamus 0.78 0.972
Hippocampus 0.88 0.951
Amygdala 0.82 0.939
Caudate 0.92 0.942
Authors should discuss potential reasons for such differences, for example in the context of acquisition variability, calibration methods or segmentation methods (Freesurfer longitudinal versus cross-sectional or other methods).
Our results were compared to the cross-sectional results from (Jovicich 2013) which had notably worse reliability than the longitudinal results. Two new tables were included that compare our results to (Jovicich 2013) and (Cannon 2014) both in the manuscript and in the response to reviewer 2’s similar major concern. The mean within-site ICC’s were in the same range as (Jovicich 2013), which makes sense given that we ran the same cross-sectional pipeline, and the sequences were close to the standard vendor sequences of (Jovicich 2013). In the response to reviewer 2 and in the discussion we have also compared our results to the calibration efforts of Schnack et al 2004, where histogram intensity thresholds were calibrated rather than the volumes themselves.
We have clarified that \(i\) refers to subjects, and the choice of weights for the disease contrast vector. The coefficients are 00, 10 and 20 because the last 0 denotes that these are the true, unobserved effect sizes. The text now reads:
"We first defined the true, unobserved model for subject \(i\) at site \(j\) as:
\[U_{ij} = \beta_{00} + \beta_{10}X_{i,j} + \beta_{20}Z_{i,j} + \epsilon_{i,j}\]
Where \(U_{i,j}\) is the unobserved value of the regional brain volume of interest (without any effects from the scanner), and \(\beta_{00}, \beta_{10}\) and \(\beta_{20}\) are the true, unobserved, effect sizes. The covariates are \(Z_{i,j}\), residuals are \(\epsilon_{i,j}\), and the contrast vector, \(X_{i,j}\), is given the weights \(X_{high}, X_{low} = 0.5,-0.5\) so that \(\beta_{10}\) is computed as the average difference between the high and low groups. For this derivation we assume an equal number of subjects observed at each site in the high and low groups with balanced covariates. \(\epsilon\) is normally distributed with mean 0 and standard deviation \(\sigma_0\)."
Information to the acquisition tables was added (such as: whether or not distortion correction was used, number of channels in the head coil, orientation, FOV, readout direction, operating system version, parallel imaging/acceleration factor, averaging information, acquisition time per volume, 3D sequence used) and split them up in 4 tables. The text was edited as follows:
“Two scans were acquired from each subject, where the subject got out of the scanner between scans for a couple minutes, and was repositioned and rescanned by the scanning technician of that particular site.”
We have added all the information to tables 1-4, and now the variability within the consortium is more apparent. We have not included the institution name, city and country in these tables because it is not as relevant as the sequence/scanning parameters and test-retest reliability, but the information is provided in the supplemental materials.
The type of freesurfer pipeline used was clarified in the text:
The standard Freesurfer (Fischl 2002) version 5.3.0 cross-sectional pipeline was run on each site’s native T1-weighted protocol, using the RedHat 7 operating system
There was a typo where \(\alpha\) was written instead of \(a\), and now the text has been edited:
" “The OLS was run with the intercept fixed at 0. \(CV_a\) for each metric was calculated as follows:”
\[CV_{a} = \frac{std(\hat{a})}{mean(\hat{a})} \label{eq:cvadef}\]
As stated earlier in the text (equation 4), we assume \(a\) to come from a normal distribution:
\[%\label{alphascale} a_j \sim N(\mu_{a},\sigma_a^2)\]
The \(CV_a\) estimate is calculated from the mean and standard deviation of the sampling distribution of the estimate of \(a\), which is \(\hat{a}\)"
In a multisite model, we are sampling effect sizes from our set of sites, and then taking an average of those samples and testing whether or not this average effect is significantly different from 0. Becuase of this, its really important to have enough site-level samples to estimate a mean effect. The plots don’t go down to the single site case because the power curves would not apply there - the model would be different. In a single site case, one simply needs to power a two sample T-test, given an effect size, number of subjects, false positive rate. If similar parameters are taken to a single site case (effect size=0.2, alpha=0.002, power = 80%), one would need 1550 subjects, all acquired at one site, to power this. However, it takes a really long time to acquire that many subjects for one site, and it is likely the scanner will go through upgrades, or protocols will change in the meantime. The n cutoff (number subjects per site for our 20 sites) that was chosen for this plot was 150 subjects per site, which is the maximum number we would ask our consortium to collect. Ideally this would be even lower, especially if researchers wanted to study very rare diseases. At a certain point, even with 0 variability from MRI, there are not enough sites for effect sizes that are so small (which is the case with genetics), and this is why the number of sites do not go below 10 for this particular effect size (\(<10\) samples is not enough for 80% power, even with no bias).
Roche et al 2014 - was about partial volume estimation using a bayesian MAP formulation that extends the mixel model. Our segmentation results do not account for partial voluming effects. In the text we added:
“Another factor not accounted for in our segmentation results was the effects of partial voluming, which adds uncertainty to tissue volume estimates. In (Roche 2014), researchers developed a method to more accurately estimate partial volume effects using only T1-weighed images from the ADNI dataset, which resulted in higher classification accuracy between Alzheimer’s disease (AD) patients and mild cognitively impaired (MCI) patients from normal controls (NL).”
Wolz et al 2014: LEAP algorithm on hippocampal volumes, compared between 1.5T and 3T found small bias (1.17% mean signed difference) between field strengths on ADNI data. Jovicich 2009 - they found that test-retest reproducibility does not change much cross platforms and field strengths. In the discussion section I’ve added:
“Even though we did not standardize the protocols and scanners within this study, the consortium is unbalanced in that there are 16 3T scanners of which 11 are Siemens. Of the Siemens 3T scanners, there is little variability in TR, TE and TI, however, there is more variance in the use of parallel imaging, the number of channels in the head coil (12, 20 or 32), and the field of view. We could not detect differences in scan-rescan reliability between field strengths, similar to the findings of (Jovicich 2009). Wolz and colleagues also could not detect differences in scan-rescan reliabilities of the hippocampus volumes estimated by the LEAP algorithm, but they detected a small bias between field strengths, where the hippocampus volumes in the 3T ADNI scanners were 1.17 % larger than the 1.5T (Wolz 2014). A two-sample T-test with unequal variances was run between the scaling factors of the 1.5T versus 3T scanners, and we could not detect differences in any ROI except for the left- and right- amydgala. We found that the scaling factors were lower than the 3T scanners (.9 versus 1.02), meaning that the amygdala volume estimates from the 1.5T were larger than those of the 3T. However, this interpretation is limited due to the small sample size of 1.5T scanners in this consortium.”
Wyman et al. 2013: I’m not sure if this is the correct paper you are referring to, but this paper emphasized the use of ADNI’s standardized data sets, which they say add (1) Greater rigor in reporting (2) The ability to compare various techniques side-by-side 3) The ability to evaluate robustness of a given technique and 4) The ability to replicate methods. In order to enable other researchers to compare our methods, evaluate robustness and replicate our study, we will provide, in the supplemental materials, the raw data on MRI volumes produced from Freesurfer, along with the python and R code to calculate scaling factors, leave-one-out calibration, and between-/within- site ICC.
Whitwell et al. 2012: Found different rates of hippocampal atrophy in the ADNI cohort than the Mayo Clinic Study of Aging cohort, even though there were no differences in hippocampal volumes between the matched cohorts. This is attributed to sampling of different populations rather than differences in hippocampal volumes due to differences in acquisition parameters. In the text I added:
“The other limitation of this study is that we assumed that subjects across all sites will come from the same population, and that stratification occurs solely from systematic errors within each site. In reality, sites may recruit from different populations, and the true disease effect will vary even more. For example, in a comparison study between the matched ADNI cohort and a matched Mayo Clinic Study of Aging cohort, researchers found different rates of hippocampal atrophy even though no differences in hippocampal volumes was detected (Whitwell 2012). This could be attributed to sampling from two different populations. This added site-level variability requires a larger site-level sample size, for an example of modeling this, see (Han 2011). ”
Tyrone D. Cannon, Kristin Cadenhead, Barbara Cornblatt, Scott W Woods, Jean Addington, Elaine Walker, Larry J Seidman, Diana Perkins, Ming Tsuang, Thomas McGlashan, others. Reliability of neuroanatomical measurements in a multisite longitudinal study of youth at risk for psychosis. Human Brain Mapping 35, 2424–2434 (2014). Link
Jorge Jovicich, Moira Marizzoni, Roser Sala-Llonch, Beatriz Bosch, David Bartrés-Faz, Jennifer Arnold, Jens Benninghoff, Jens Wiltfang, Luca Roccatagliata, Flavio Nobili, others. Brain morphometry reproducibility in multi-center 3T MRI studies: a comparison of cross-sectional and longitudinal segmentations. Neuroimage 83, 472–484 Elsevier, 2013.
Hugo G. Schnack, Neeltje E.M. van Haren, Hilleke E. Hulshoff Pol, Marco Picchioni, Matthias Weisbrod, Heinrich Sauer, Tyrone Cannon, Matti Huttunen, Robin Murray, Ren� S. Kahn. Reliability of brain volumes from multicenter MRI acquisition: A calibration study. Human Brain Mapping 22, 312–320 Wiley-Blackwell, 2004. Link
Christine L Tardif, D Louis Collins, G Bruce Pike. Regional impact of field strength on voxel-based morphometry results. Human brain mapping 31, 943–957 Wiley Online Library, 2010.
Christine L Tardif, D Louis Collins, G Bruce Pike. Sensitivity of voxel-based morphometry analysis to choice of imaging protocol at 3 T. Neuroimage 44, 827–838 Elsevier, 2009.
Daniel-Paolo Streitbürger, André Pampel, Gunnar Krueger, Jöran Lepsien, Matthias L Schroeter, Karsten Mueller, Harald E Möller. Impact of image acquisition on voxel-based-morphometry investigations of age-related structural brain changes. Neuroimage 87, 170–182 Elsevier, 2014.
B. Fischl, D. H. Salat, E. Busa, M. Albert, M. Dieterich, C. Haselgrove, A. van der Kouwe, R. Killiany, D. Kennedy, S. Klaveness, A. Montillo, N. Makris, B. Rosen, A. M. Dale. Whole brain segmentation: automated labeling of neuroanatomical structures in the human brain. Neuron 33, 341-355 (2002).
Randy L Buckner, Denise Head, Jamie Parker, Anthony F Fotenos, Daniel Marcus, John C Morris, Abraham Z Snyder. A unified approach for morphometric and functional data analysis in young, old, and demented adults using automated atlas-based head size normalization: reliability and validation against manual measurement of total intracranial volume. Neuroimage 23, 724–738 Elsevier, 2004.
Duane C Boes, FA Graybill, AM Mood. Introduction to the Theory of Statistics. Series in probabili (1974).