# Detailed Reviewer Responses

Abstract

We would like to thank the reviewers for their insightful comments. The major points that have been addressed are as follows:

1. It was not our intention to give the impression that one needs to scan human calibration phantoms at each site to properly power a multisite study with nonstandardized parameters, which is very costly. The statistical model which takes MRI bias into account has been emphasized instead. The bias that was measured and validated via calibration served to corroborate the scaling assumption of the statistical model. For other researchers planning multisite studies, the statistical model we proposed with the biases we reported should help plan and power a study.

2. Our measurements have been compared with other harmonization efforts, specifically (Cannon 2014, Jovicich 2013) and (Schnack 2004).

3. The scanning parameters of our consortium have been better specified.

4. The independence assumption between the unobserved effect and the scaling factor for a particular site have been addressed. Specifically, we emphasized that this assumption could hold for MS patients based on our experiment. The need to validate this assumption for other situations by scanning human phantoms was recommended, and the equation of variance without the independence assumption has been provided for the readers.

# Reviewer 1

##Major Points

• Assumption is made that the major contributions to the boundary variability are the contrast differences and gradient distortions, but it was shown (see Tardif 2010 "Regional Impact of Field Strength on Voxel-Based Morphometry Results" and Tardif 2009 "Sensitivity of voxel-based morphometry analysis to choice of imaging protocol at 3 T"), that different protocols and field strengths significantly affect signal and contrast inhomogeneity patterns, which leads to differences in regional accuracy of tissue classification due to different contrast to noise ratio.

The papers cited here add evidence for our hypothesis that different ROIs will have different scaling factors, due to regional differences in contrast, as shown in a VBM analysis by Tardiff and colleagues. The section has been edited as follows:

“We hypothesized that all differences in regional contrast and geometric distortion result in regional volumes that are consistently and linearly scaled from their true value. For a given region of interest (ROI), two mechanisms impact the final boundary definition: gradient nonlinearities cause distortion and, simultaneously, hardware (including scanner, field strength, and coils) and acquisition parameters modulate tissue contrast, adjusting the whole boundary. Previously, Tardiff and colleagues have found that contrast-to-noise ratio and contrast inhomogeneity from various pulse sequences and scanner strengths cause regional biases in VBM(Tardif 2010, Tardif 2009), and therefore we hypothesized that each ROI will scale differently at each site. By imaging 12 subjects in 20 different scanners using varying acquisition schemes, we were able to estimate the scaling factor for each regional volume at each site...”

• Assumption about independence of MRI-related biases and biological effect being studied (a_j) is independent from D_u,j). It was shown that there are significant differences in detecting anatomical changes due to ageing and acquisition parameters ( see Streitburger 2014 "Impact of imageacquisition on voxel-based-morphometry investigations of age-related structural brain changes" for differences between 12-channel and 32-channel RF coil in MPRAGE and for differences caused by varying resolution).

In (Streitbürger 2014), the researchers found that different pulse sequences and hardware (12 ch vs 32ch RF coil) altered the estimate of the coefficient on gray matter density in an age regression. This follows from our hypothesis on the scaling property of regional volumes from MRI. $$D_{u,j}$$ is the unobserved effect - in the case of (Streitbürger 2014), the true GM density dependence on age that cannot be measured directly. We have proposed that the reason for the differences seen in regional volume analyses (and possibly VBM) was due to different scaling biases, $$a_j$$, for different pulse sequences. It is definitely true that $$a_j$$ and $$D_{Y,j}$$(the observed effect) are correlated, but the assumption of independence of the unobserved effect ($$D_{U,j}$$) with $$a_j$$ needs to be validated, because it is possible that scaling factors may be different for different disease groups. To verify this assumption, we calculated scaling factors on MS patients and healthy controls and could not detect differences between them. Therefore, in the case of healthy controls (aged 24-57) and MS patients, $$a_j$$ is only dependent on scanner hardware/acquisition parameters.

## Minor Points

• The authors have performed experiments using data collected at both 3T scanners and 1.5T scanners, and performed data processing using Freesurfer version 5.3.0. This software uses different parameters for non-uniformity correction if data from 3T scanner is processed, and requires use of special flag to change this parameter. It is not clear if this was done in the experiment.

The -3T flag was not used because differences in Freesurfer processing of other datasets were not observed in our lab. The parameters have been specified as follows:

“A neuroradiologist reviewed all images to screen for major artifacts and pathology. The standard Freesurfer (Fischl 2002) version 5.3.0 cross-sectional pipeline (recon-all) was run on each site’s native T1-weighted protocol, using the RedHat 7 operating system. Both 1.5T and 3T scans were run with the same parameters (without using the -3T flag), meaning that the non-uniformity correction parameters were kept at the default values.

• Manual QC removed 1-2 scans from half of the sites, was there any pattern in which sites had problems associated with QC? I.e was it associated with particular acquisition parameters?

A few scans failed initial QC due to small amounts of dura classified as brain at the top of the head that required editing and were mistakenly excluded. These scans are now included and do not change the results (in particular, $$CV_a$$, as this is the most important for power calculations). We have included the results from the revised dataset in the manuscript. For the new QC procedure, eTIV was checked as suggested by the reviewer; this QC check is advantageous because it is the first step of recon-all and is sensitive to misregistration to the talairach template that could lead to inaccurate segmentations overall. Finally only 3 scans were rejected due to failing QC at the registration step, which were from the same subject at sites 4 and 6 and second subject at site 11. Three other scans were not included due to acquisition/ data transfer issues: One scan was not acquired due to time constraints, one scan was left out because part of the head was not acquired (possibly the files were corrupted), and one scan was erroneously deleted at the site. A column has been added in the acquisition tables to reflect how many actual scans were acquired at each site.

• Figure 2 shows that total intracranial volume has the lowest agreement between sites even after calibration, followed by Thalamus (which is consistent with findings in Schnack 2010). This is strange, considering that this ROI encompasses all others, so averaging over larger volume will reduce errors. Looking at tables 1&2 one can see also that test-retest variability of TIV was particularly low in three sites (4, 6, 11 ) , perhaps this step of the procedure failed in several scans originating from these sites and they have to be excluded from the analysis?

Freesurfer uses an atlas scaling factor (ASF) to calculate the eTIV, so even though the ROI encompasses all others, it is not calculated in the same way as other ROIs (i.e. by counting the voxels inside the skull). Rather, it estimates a linear transformation to an atlas space (Talairach) and calculates the determinant of the transformation matrix, which correlates to ICV (Buckner 2004). So actually the mechanisms behind the biases in eTIV are different than those behind other regional volumes (i.e. regional variation in CNR and gradient distortions affect the boundary). Because of this, test retest reliability of eTIV was included, but was excluded from the scaling factor calculations and validations.

• The size of the field of view and read-out direction is known to change distortion patterns in the scans these parameters should be added to the tables 1&2.
This information was added to the scanning tables, which have now been split into 4 tables. This information shows more variability between our sites.

• Some scanners have built-in geometric distortion correction, sometimes it is a parameter configurable on the console, it would be interesting to know if it was used in the experiments.
This information was added to the acquisition tables 1-4.

• It is mentioned in the Methods section that repositioning consistency of each site's scanning procedure was captures, however there is no metric in the results section focusing explicitly on that (i.e consistency of the subject position). It would be important to compare it to the consistency of subject positioning in Caramanos 2010 , since it will affect assumption on ability to correct gradient-distortions caused variations with a scaling factor derived from different acquisition.

Caramanos 2010 looked specifically at repositioning in relation to SIENA, which is a longitudinal analysis, but one could assume that repositioning affected the scan-rescan variability of some ROIs in our cross-sectional analysis. We have assumed that the consistency of positioning varies between sites, because a standardized positioning protocol was not defined. A big reason for this was to be able to incorporate data that was previously acquired for future studies. We have some measure of positioning consistency that was reflected in the amount of variability due to the site-by-run interaction term in our ICC calculation, since variability in positioning would contribute to the variability of volumes between runs at particular sites. This variance component was much smaller than the variance due to subjects/sites/subjects x site/unexplained, and the specific variance component for run x site was reported for each ROI in the supplemental materials. Overall, the majority of the variance was explained by subject, site, and the subject x site interaction than by the site by run term.

• Results of comparison of the calibration between normal and MS subjects in the discussion section show that the agreement is better for some structures and worse for others, it would be important to know if the difference is statistically significant for some structures.

In the paper, scaling factor estimates were compared for subcortical gray matter volume, white matter volume and total gray matter volume, because these were specifically edited, and because they are the most relevant for future studies on MS and genetics that this consortium is proposing. The largest difference between scaling factors between healthy controls and MS was for white matter volume, and a two sample t-test between the scaling factors yielded a p-value of 0.88. In the paper, the following was added:

“Even though this study calculated scaling factors from healthy controls, we showed that scaling factors derived from an MS population were very similar or identical to those derived from healthy controls. The largest difference in scaling factors between HC versus MS patients was for white matter volume, where $$a_{MS} = .967$$ and $$a_{HC} = .975$$. A two-sample T test between the scaling factors resulted in a p-value of $$.88$$, showing that we could not detect a significant difference in scaling factors between HC and MS. This part of the study was limited in that we only scanned patients at two scanners, while the healthy controls were scanned at 20. However, the similarity between scaling factors for the subcortical gray matter, cortical gray matter and white matter volumes between the MS and HC populations suggests that, given careful editing of volumes on the disease population, the independence assumption holds for MS. For researchers studying other diseases, it may be useful to scan healthy controls and patients before and after an upgrade or sequence change to test the validity of the independence assumption.”

# Reviewer 2

## Major Points

• If a financial claim is done, it must be substantiated. The costs of having N phantom subjects travel and be scanned at each site of a multi-centric study are certainly prohibitive, and it cannot be assumed that it is necessarily lower than harmonizing protocols and acquisitions, as well as scanning geometric phantoms at each site. The authors site ADNI as an example, and it is true that ADNI has gone through an extensive and expensive process of harmonizing sequences, building a phantom, and scanning the phantom before each scan. However, those are now sunk costs, which, thanks to ADNI, need not be borne by every new study : as a case in point, there is a large body of projects that now use the ADNI protocol as is, or with only slight variations, and thereofre needing a fraction of the cost to harmonize protocols. Further, ADNI itself has moved away from scanning geometric phantoms before each scan. It is an open question in the literature what the right frequency of phantom scan is to properly sample scanner drift, however the current rule of thumb falls between weekly and monthly; for a 15 minute scan, this is not an overly expensive proposition.

The overall goal of this project was not to claim that the method of scanning 12 phantom subjects was cost effective. Rather, the goal was to measure MRI-related biases when systems are not standardized, and then see how one can overcome these biases with proper sample sizes, rather than a costly calibration method or harmonization (for the case of retrospective data). This also allows sites the freedom to upgrade hardware/software or even change sequences during a study. This might be an incentive for sites to contribute data even if they are given little financial support. The phantom calibration aspect has been minimized and our statistical model that accounts for MRI-related biases has been emphasized. The measurements of that bias (which were estimated via calibration) are an important part of this study because they validate the scaling assumption of the statistical model and provide researchers values to plug into the power equation. Our framework provides an alternative method to ADNI harmonization, rather than a strict improvement. The human phantom calibration showed that the overall absolute agreement between sites improves to the same level of ADNI-type harmonization. Our results are compared to other harmonization efforts in the manuscript and in the following response.

• The proposed technique is not directly compared to any other harmonization effort. While head to head results would be hard to get, a random sampling of control subjects in ADNI, for example, at 20 different sites, could have been done; with similar analyses of results; and then the coefficients of variability directly compared. Then, one would have an idea of which approach is, or isn't, superior.

Colleagues who work on ADNI data at UCSF verified that the same healthy controls were not scanned at multiple sites, and so the method is not directly comparable. Instead our results were compared to 3 other harmonization efforts. In the first, our between site ICC results pre- and post- calibration were compared to the between-site ICC results of (Cannon 2014). In (Cannon 2014), the sites were harmonized and an ADNI phantom was used to correct gradient distortions. The authors ran FSL’s FIRST for subcortical segmentation, and a cortical pattern matching algorithm for gray/white segmentation. They then calculated the between-site ICC using variance components similar to our study. The table below was added to the manuscript:

Between-site ICC comparison to the study by (Cannon 2014), where MRI sequences were standardized and subcortical segmentation was performed using FIRST, and cortical segmentation using cortical pattern matching. ICC BW and ICC BW Cal were calculated using our multisite healthy control data, where ICC BW Cal was calculated as the between site ICC of volumes after applying the scaling factor from a leave-one-out calibration. Other than the thalamus (Thal), we found that the between-site ICCs were comparable to (Cannon 2014) for the amygdala (Amyg), caudate (Caud), and even higher for the hippocampus (Hipp), gray matter volume (GMV) and white matter volume (WMV)
ROI ICC BW ICC BW Cal (Cannon 2014) ICC BW
GMV .78 .96 .854
WMV .96 .98 .774
Thal .61 .73 .95
Hipp .75 .84 .79
Amyg .56 .74 .76
Caud .82 .91 .92

\label{comparetocannon}

The two ROIs that do not compare to (Cannon 2014) are the thalamus and the white matter volume (WMV). For the thalamus, it is possible that the FIRST algorithm is more reliable at segmenting this structure, while for white matter volume, Freesurfer is more reliable. Another table by (Jovicich 2013) was included where sites were not strictly harmonized, but different control subjects were scanned at each site, and the authors used the same freesurfer cross-sectional algorithm that we used. Instead of calculating between-site ICC’s, they calculated the average within-site ICCs for each ROI. The following table (which is now included in the manuscript) compares our within-site ICC’s pre- and post- calibraiton to (Jovicich 2013) average within-site ICC values:

Comparing the within-site ICC before and after leave-one-out scaling factor calibration with the cross-sectional freesurfer results of (Jovicich 2013), where scanners used vendor-provided T1-weighted sequences, and the average within-site ICC is shown. The within-site ICCs of our study fall within the range of (Jovicich 2013), which shows the that sites in this study are as reliable as those in (Jovicich 2013).
ROI ICC WI ICC WI Cal (Jovicich 2013) ICC WI Average
LV 1 1 $$.998 \pm 0.002$$
Thal .86 .84 $$0.765 \pm .183$$
Hipp .93 .93 $$0.878 \pm .132$$
Amyg .89 .86 $$0.761 \pm .134$$
Caud .97 .97 $$0.909 \pm 0.092$$

Here, we see that the within-site thalamus ICC values fall within the range of (Jovicich 2013), along with the other ROIs. Again, the calibration done here is to show that when the biases are applied to the data, the results are comparable to the harmonized results, and therefore the variability of biases ($$CV_a$$) could be trusted.

Finally, the following comparison to Schnack 2004 was added:

“The data acquisition of our study is similar to that of (Schnack 2004), in which the researchers acquired T1-weighted images from 8 consistent human phantoms across 5 sites with non-standardized protocols. These scanners were all 1.5T except for one 1T scanner. (Schnack 2004) calibrated the intensity histograms of the images before segmentation with a calibration factor estimated based on the absolute agreement of volumes to the reference site (ICC). After applying their calibration method, the ICC of the lateral ventricle was $$\geq 0.96$$, which is similar to our pre- and post- calibrated result of $$0.97$$. The ICC for the intensity calibrated gray matter volume in (Schnack 2004) was $$\geq 0.84$$, compared to our calibrated between-site ICC of $$0.78$$ (uncalibrated), and $$0.96$$ (calibrated). Our between-site ICCs for white matter volume ($$0.96$$ and $$0.98$$ for the pre- and post- calibrated volumes, respectively) were much higher than those of the intensity calibrated white matter volume in (Schnack 2004) ($$\geq .78$$). This could be explained by the fact that our cohort of sites is a consortium studying multiple sclerosis, which is a white matter disease, so there may be a bias toward optimizing scan parameters for white matter. Most importantly, the calibration method of (Schnack 2004) requires re-acquisition of a human phantom cohort at each site for each multisite study. Alternatively, multisite studies employing our approach can use the results of our direct-volume calibration (the estimates of $$CV_a$$ for each ROI) to estimate sample sizes based on our proposed power equation and bias measurements without acquiring their own human phantom dataset to use in calibration.”

• Methods : please provide confirmation of the institutional ethics approval for this project

The following confirmation of IRB approval was added to the Methods section:

“Institutional approval was acquired and signed consent was obtained for each subject at each site.”

• Methods: the justification for the number of phantom subjects and the number of sites needed to ascertain the power is not given. Basically, why 12? Why 20? While this could be gathered from Figure 1, the case for these exact numbers is not explicitly made

The number of phantom subjects do not contribute to the power equation in Figure 1. The power equation that was derived does not account for any sort of calibration or scaling. Instead, it requires an estimate for $$CV_a$$, which is the variability of scaling biases between sites. Our consortium was not asked to change protocols, so the true bias variability might be higher than that of a harmonized study, and sample sizes we would calculate would be smaller than what is actually needed if harmonized data was used to calculate bias. For other researchers planning multisite studies, our measurements of $$CV_a$$ could be used to get a better power estimate for protocols that are close to the 3D-MPRAGE protocols that were described here. This is especially useful for the inclusion of retrospective data.

To power an estimate of $$CV_a$$, two sample sizes are required - the subject level, and the site level. The sample size of 12 is enough to estimate the slope of the calibration line (which is around 1), because the scan-rescan reliability of the ROIs is very high (around 0.9). Ideally, we would have more than 12 subjects to get better a better estimate of the intercept, and this was listed as a study limitation in the discussion. 20 sites were sampled because these same sites will be used in a future study on MS genetics and MRI phenotypes, although the inclusion of more sites would also improve the $$CV_a$$ estimate.

In the text, it was emphasized that the derived power equation does not require phantom subjects or calibration to be used:

We emphasize that the use of phantom subjects does not directly contribute to the power equation in Figure 1, as it does not account for any sort of calibration or scaling. However, it requires an estimate for $$CV_a$$, which is the variability of scaling biases between sites. The goal of this study is to provide researchers with estimates of $$CV_a$$ from our set of calibration phantoms and our set of non-standardized MRI acquisitions. For a standardized set of scanners, these values may be considered an upper bound.

• Methods : As stated in section 3.2, the authors used ROIs volume average across hemispheres. It is well known that multiple ROIs display left-right asymmetry. The rational behind this procedure is not understood, since it adds artificial volumes variability while the goal of the manuscript is to provide a way to obtain more precise or "true" volumetric measures across sites. At least this procedure needs to be well justified.

Because ROIs display left-right asymmetry, the analysis was re-run so that each scaling factor was estimated for each ROI on the left and right hemispheres separately. In order to compare to previous research (Cannon 2014, Jovicich 2013), the ICCs were averaged across hemispheres in tables 6 and 7. In table 5, the $$CV_a$$ values were shown for each hemisphere separately.

• Methods: In formula (6), and with the assumption of independence, of scaling factor and unobserved difference between groups (high and low disease group), we lose more information. It is understood that this assumption is necessary to calculate the unconditional variance of the disease effect estimate at each site, but, if we remove the independence assumption (we don't drop the covariance) and we use an approximate instead (eg. numeric approximation), we lose less information although the calculation will be more complex.

The assumption of independence between the unobserved difference between groups and the scaling factor of a site is certainly invalid if people who have more disease have tissue with different properties that affect regional contrast. It would mean that the scaling factor is correlated with the unobserved effect because different patient groups have different scaling factors. In dropping the covariance terms here, power would be overestimated because the denominator of the non-centrality parameter would be larger. We scanned MS patients to see how different the scaling factors were to address this concern. The scaling factors were very similar and in one case, identical, with the caveat that the images were QC’d more carefully to check and correct for gross segmentation missclassifications. The full equation for the calculation of variance has been added to the appendix, which includes the covariance terms before they are dropped:

$\begin{split} var[XY] = \mu_Y^2var[X] + \mu_X^2var[Y] + 2\mu_X\mu_Ycov[X,Y] \\ - (cov[X,Y])^2 + E[(X-\mu_X)^2(Y-\mu_Y)^2] \\ + 2\mu_YE[(X-\mu_X)^2(Y-\mu_Y)] + 2\mu_XE[(X-\mu_X)(Y-\mu_Y)^2] \end{split}$

Since scaling factor differences were not detected between MS patients and healthy controls, we do not think an approximation of covariance is within the scope of this paper. It is necessary for researchers to investigate if this independence assumption is strongly violated for the particular disease they are studying. The point that this assumption is valid for MS only was discussed more thoroughly in the discussion section.

## Minor Points

• In appendix, in the likelihood formula (17) and last formula in (18) the product and sum are on "j" and not on "i".

The formulas were changed as follows:

$L_1 = \prod\limits_j\frac{1}{\sqrt{2\pi a_j^2 V_j}}exp\Big(\frac{-(\beta_{1j}- \mu)^2}{2 a_j^2 V_j}\Big)$

and

$\begin{gathered} \frac{\partial}{\partial \mu}\Big(log(L_1)\Big) = \frac{\partial}{\partial \mu} \Big(\sum\limits_j^J \frac{1}{\sqrt{2 \pi a_j^2 V_j}} + \sum\limits_j^J \frac{(\beta_{1j} - \mu)^2}{2 a_j^2V_j}\Big) = 0 \implies \hat{\mu} = \frac{\sum\limits_j^J a_j^{-2}V_j^{-1}\beta_{1j}}{\sum\limits_j^J a_j^{-2}V_j^{-1}} \end{gathered}$

• The function log in the second formula in (18) seems to have been forgotten. Although this doesn't have an effect on the maximum likelihood estimator of the mean, it should be displayed. It could also be proven that log(L1) is concave.

The formula has been changed to:

$\begin{gathered} \frac{\partial}{\partial \mu}\Big(log(L_1)\Big) = \frac{\partial}{\partial \mu} \Big(\sum\limits_j^J log(\frac{1}{\sqrt{2 \pi a_j^2 V_j}}) + \sum\limits_j^J \frac{(\beta_{1j} - \mu)^2}{2 a_j^2V_j}\Big) = 0 \implies \hat{\mu} = \frac{\sum\limits_j^J a_j^{-2}V_j^{-1}\beta_{1j}}{\sum\limits_j^J a_j^{-2}V_j^{-1}} \end{gathered}$

• The calculation of variance in formula (6) should be added in the Appendix because it is not obvious

The full derivation was added to the Appendix, specifically:

The proof for this is found in Introduction to the Theory of Statistics (1974) by Mood, Graybill and Boes (Boes 1974), section 2.3, Thoerem 3:

Let $$X$$ and $$Y$$ be two random variables where $$var[XY]$$ exists, then

$\begin{split} var[XY] = \mu_Y^2var[X] + \mu_X^2var[Y] + 2\mu_X\mu_Ycov[X,Y] \\ - (cov[X,Y])^2 + E[(X-\mu_X)^2(Y-\mu_Y)^2] \\ + 2\mu_YE[(X-\mu_X)^2(Y-\mu_Y)] + 2\mu_XE[(X-\mu_X)(Y-\mu_Y)^2] \end{split}$

which can be obtained by computing $$E[XY]$$ and $$E[(XY)^2]$$ when $$XY$$ is expressed as

$XY = \mu_X\mu_Y + (X-\mu_X)\mu_Y + (Y-\mu_X)\mu_X + (X-\mu_X)(Y-\mu_Y)$

If $$X$$ and $$Y$$ are independent, then $$E[XY] = \mu_X\mu_Y$$, the covariance terms are 0, and

$E[(X-\mu_X)^2(Y-\mu_Y)^2] = E[(X-\mu_X)^2]E[(Y-\mu_Y)^2] = var[X]var[Y]$

and

$\mu_YE[(X-\mu_X)^2(Y-\mu_Y)] = E[(X-\mu_X)^2]E[(Y-\mu_Y)] = 0$

$\mu_XE[(Y-\mu_Y)^2(X-\mu_X)] = E[(Y-\mu_Y)^2]E[(X-\mu_X)] = 0$

Which gives

$var[XY] = \mu_X^2var[Y] + \mu_Y^2var[X] + var[X]var[Y]$

• 3.1 "Two scans were acquired from each subject ": This is not clear. Did all subjects have scan-rescan on each scanner? Please also specify whether the participants were healthy subjects or MS patients.

More detail on how subjects were scanned was added:

“Two scans were acquired from each subject, where the subject got out of the scanner between scans for a couple minutes, and was repositioned and rescanned by the scanning technician of that particular site.

• 3.2 Please specify the hardware on which Freesurfer was run.

We have added the following, to give specifics on our processing environment:

The standard Freesurfer (Fischl 2002) version 5.3.0 cross-sectional pipeline (recon-all) was run on each site’s native T1-weighted protocol, using the RedHat 7 operating system on IEEE 754 compliment hardware.

• 3.3 "Because MS patients have lesions that could affect Freesurfer's tissue classification, all images were manually corrected for lesions on the T1-weighted images by a neurologist" More details should be provided (e.g. which regions were typically modified). Why not run FreeSurfer on the MS subjects as is, because the bias would be the same?

During the QC process on the MS patients, we found errors that were very different from the errors on healthy controls, due to the effect of lesions. When calculating a scaling factor, it is really important that the tissue classification is accurate, otherwise the scaling factor estimate could be incorrect. There may be a segmentation pipeline that handles lesions better, and optimizing that pipeline will happen when patient data from the consortium is acquired. To clarify this point, and to give more details on the types of errors, the text has been edited as follows:

The accuracy of our scaling factor estimates depends on the accuracy of tissue segmentation, but the lesions in MS specifically impact white matter and pial surface segmentations. Because of the effect of lesions on Freesurfer’s tissue classification, all images were manually corrected for lesions on the T1-weighted images by a neurologist after editing by Freesurfer’s quality assurance procedure, which included extensive topological white matter corrections, white matter mask edits, and pial edits on images that were not lesion filled. These manual edits altered the white matter surface so that white matter lesions were not misclassified as gray matter or non-brain tissue. The errors in white matter segmentations most typically occurred at the border of white matter and gray matter and around the ventricles. The errors in pial surface segmentations most typically occurred near the eyes (orbitofrontal) and the superior frontal or medial frontal lobes. Images that were still misclassified after thorough edits were removed from the analysis, because segmentations were not accurate enough to produce realistic scaling factor estimates.

# Reviewer 3

## Major Points

• Does the method here proposed offer a more cost-efficient approach relative to other studies? The claim that this work proposes a more efficient multi-centric calibration approach should be reconsidered in the light of past work. It is true that the ADNI method involved considerable effort to install on different vendors particular sequences, which was possible by having agreements with the various vendors. However, simpler approaches have recently shown that high multi-centric reliability can be also achieved by taking the standard sequences available on various makes and models and limit the calibration to the setting of a few basic key acquisition parameters that optimize tissue contrast (Jovicich et al., 2013). This approach did not require software or hardware adjustments, did not use geometric calibration phantoms and did not use a group of volunteers travelling for scans at multiple sites distributed across countries. I highly doubt that sending 12 people across the USA and Europe to be scanned on 20 MRI sites can be considered more efficient and less costly than sending no person at all, or maybe one to check the local protocol after the setting of a few acquisition parameters. Or maybe the authors meant something else, please clarify.

This important point was also made by reviewer 2, and we have clarified the overall goal of this project. We will restate our response here. The overall goal of this project was not to claim that the method of scanning 12 phantom subjects was cost effective. Rather, the goal was to measure MRI-related biases when systems are not standardized, and then see how one can overcome these biases with proper sample sizes, rather than a costly calibration method or harmonization (for the case of retrospective data). This also allows sites the freedom to upgrade hardware/software or even change sequences during a study. This might be an incentive for sites to contribute data even if they are given little financial support. The phantom calibration aspect has been minimized and our statistical model that accounts for MRI-related biases has been emphasized. The measurements of that bias (which were estimated via calibration) are an important part of this study because they validate the scaling assumption of the statistical model and provide researchers values to plug into the power equation. Our framework provides an alternative method to ADNI harmonization, rather than a strict improvement. The human phantom calibration showed that the overall absolute agreement between sites improves to the same level of ADNI-type harmonization. Our results are compared to other harmonization efforts in the manuscript and in the following response.

Our sites have used sequences that are similar to the vendor provided-T1 sequences, and (Jovicich 2013) found that high multicenter reliability can be achieved using these standard vendor sequences with very few calibration efforts. However, many of the sites in our consortium are in the middle of longitudinal studies within their sites, and are hesitant to make even very small protocol changes, despite the result from (Jovicich 2013), which was for the longitudinal processing stream. Our statistical model was for a cross-sectional design, and the evaluation of scaling bias is important to optimize sample sizes for the cross-sectional case.

• Does the method here proposed offer improved multi-centric reliability than other studies? The across site reliability measures obtained with the proposed calibration do not appear to be placed in perspective with the vast literature on this topic (for example, but not limited to: Wolz et al. 2014; Roche et al., 2014; Jovicich et al., 2013). In particular, this last study shows inter-site ICC measures on many of the same structures reported here, also obtained using Freesurfer, but with notably higher reliability than the calibrated results reported here:

Structure Between site ICC after calibration in this study (Fig. 2) Jovicich et al., 2013 (Suppl. Table 1)

Lateral ventricle 0.96 0.998

Thalamus 0.78 0.972

Hippocampus 0.88 0.951

Amygdala 0.82 0.939

Caudate 0.92 0.942

Authors should discuss potential reasons for such differences, for example in the context of acquisition variability, calibration methods or segmentation methods (Freesurfer longitudinal versus cross-sectional or other methods).

Our results were compared to the cross-sectional results from (Jovicich 2013) which had notably worse reliability than the longitudinal results. Two new tables were included that compare our results to (Jovicich 2013) and (Cannon 2014) both in the manuscript and in the response to reviewer 2’s similar major concern. The mean within-site ICC’s were in the same range as (Jovicich 2013), which makes sense given that we ran the same cross-sectional pipeline, and the sequences were close to the standard vendor sequences of (Jovicich 2013). In the response to reviewer 2 and in the discussion we have also compared our results to the calibration efforts of Schnack et al 2004, where histogram intensity thresholds were calibrated rather than the volumes themselves.

## Minor Points

• When introducing Equation 1 please state immediately that the index i refers to the subjects, instead of doing it in the following paragraph. Please also clarify why instead of using coefficients 0, 1, 2, you use 00, 10, 20. In the sentence that follows the equation please also clarify the term related to the disease contrast vector, and the choice of its weights.

We have clarified that $$i$$ refers to subjects, and the choice of weights for the disease contrast vector. The coefficients are 00, 10 and 20 because the last 0 denotes that these are the true, unobserved effect sizes. The text now reads:

"We first defined the true, unobserved model for subject $$i$$ at site $$j$$ as:

$U_{ij} = \beta_{00} + \beta_{10}X_{i,j} + \beta_{20}Z_{i,j} + \epsilon_{i,j}$

Where $$U_{i,j}$$ is the unobserved value of the regional brain volume of interest (without any effects from the scanner), and $$\beta_{00}, \beta_{10}$$ and $$\beta_{20}$$ are the true, unobserved, effect sizes. The covariates are $$Z_{i,j}$$, residuals are $$\epsilon_{i,j}$$, and the contrast vector, $$X_{i,j}$$, is given the weights $$X_{high}, X_{low} = 0.5,-0.5$$ so that $$\beta_{10}$$ is computed as the average difference between the high and low groups. For this derivation we assume an equal number of subjects observed at each site in the high and low groups with balanced covariates. $$\epsilon$$ is normally distributed with mean 0 and standard deviation $$\sigma_0$$."

• Section 3.1 Acquisition: Clarify what sequences are being considered. Clarify the procedure details followed to do the head repositioning when obtaining the test-retest data (In the same day? Did the subject walk out of the scanner? How much time between acquisitions?).

Information to the acquisition tables was added (such as: whether or not distortion correction was used, number of channels in the head coil, orientation, FOV, readout direction, operating system version, parallel imaging/acceleration factor, averaging information, acquisition time per volume, 3D sequence used) and split them up in 4 tables. The text was edited as follows:

“Two scans were acquired from each subject, where the subject got out of the scanner between scans for a couple minutes, and was repositioned and rescanned by the scanning technician of that particular site.”

• Tables 1 and 2: 3.1. These tables should appear after Figure 1 but before any of the other Figures
We have moved the tables and figures to the end of the article under the "Tables" and "Figures" headings.

• Tables 1 and 2: 3.2. The logic used to define the order in which the MRI sites are presented is unclear. Please reorder to make clearer the variability of the systems. For example, list first the 1.5T sites ordering them by vendor, and then the 3T sites ordered by vendor.
We have split the tables into 4 tables, where the first table is 1.5T scanners, the second table is 3T GE and Philips, the third is Siemens Skyra and Prisma, and the last is Siemens Trio scanners.

• Tables 1 and 2: 3.3. Add the following information for each site: site information (Institution name, city, country), MRI system model (e.g., Siemens 3T could be Allegra, Trio, Skyra, Verio, Prisma) and operating system version, parallel imaging information and acceleration factor, averaging information, acquisition time per volume, 3D sequence used, 3D slice orientation. Such information will help describe the variability within the consortium.

We have added all the information to tables 1-4, and now the variability within the consortium is more apparent. We have not included the institution name, city and country in these tables because it is not as relevant as the sequence/scanning parameters and test-retest reliability, but the information is provided in the supplemental materials.

• Tables 1 and 2: 3.4. In the table caption clarify that the test-retest reliability is within-site measured as ICC(1,1).

We have edited the captions of tables 1-4 to clarify that test-retest reliability is within-site ICC(1,1)

• Section 3.2 Preprocessing: clarify whether the longitudinal or cross-sectional Freesurfer pipelines were used in the test-retest data.

The type of freesurfer pipeline used was clarified in the text:

The standard Freesurfer (Fischl 2002) version 5.3.0 cross-sectional pipeline was run on each site’s native T1-weighted protocol, using the RedHat 7 operating system

• Equation 14: clarify what alpha and a stand for as well as the distribution from which the mean and the standard deviations are computed.

There was a typo where $$\alpha$$ was written instead of $$a$$, and now the text has been edited:

" “The OLS was run with the intercept fixed at 0. $$CV_a$$ for each metric was calculated as follows:”

$CV_{a} = \frac{std(\hat{a})}{mean(\hat{a})} \label{eq:cvadef}$

As stated earlier in the text (equation 4), we assume $$a$$ to come from a normal distribution:

$%\label{alphascale} a_j \sim N(\mu_{a},\sigma_a^2)$

The $$CV_a$$ estimate is calculated from the mean and standard deviation of the sampling distribution of the estimate of $$a$$, which is $$\hat{a}$$"

• Figure 7: It is not clear why this formulation does not describe a situation in which the population can be studied by less than 10 sites, in particular by a single site.

In a multisite model, we are sampling effect sizes from our set of sites, and then taking an average of those samples and testing whether or not this average effect is significantly different from 0. Becuase of this, its really important to have enough site-level samples to estimate a mean effect. The plots don’t go down to the single site case because the power curves would not apply there - the model would be different. In a single site case, one simply needs to power a two sample T-test, given an effect size, number of subjects, false positive rate. If similar parameters are taken to a single site case (effect size=0.2, alpha=0.002, power = 80%), one would need 1550 subjects, all acquired at one site, to power this. However, it takes a really long time to acquire that many subjects for one site, and it is likely the scanner will go through upgrades, or protocols will change in the meantime. The n cutoff (number subjects per site for our 20 sites) that was chosen for this plot was 150 subjects per site, which is the maximum number we would ask our consortium to collect. Ideally this would be even lower, especially if researchers wanted to study very rare diseases. At a certain point, even with 0 variability from MRI, there are not enough sites for effect sizes that are so small (which is the case with genetics), and this is why the number of sites do not go below 10 for this particular effect size ($$<10$$ samples is not enough for 80% power, even with no bias).

• Although the consortium includes 1.5T systems (4), it is dominated by 3T systems (16 out of 20), by one vendor (Siemens: 11 out of 16 at 3T, 2 out of 4 at 1.5T) and there does not seem to be a lot of variability in the acquisition parameters across protocols within vendor and field. Under such circumstances it seems that a proper acquisition calibration could have been fairly easy, and the consortium appears to be less of a good candidate of a consortium with high intrinsic variability on which to validate the proposed calibration method. In particular, it is well known that T1 contrast changes with field strength. Despite the unbalanced nature of the consortium, it would be useful if the authors would present and discuss reliability differences at the two field strengths evaluated, putting results in perspective with previous studies (Roche et al. 2014; Wolz et al. 2014; Jovicich et al. 2009; Wyman et al. 2013; Whitwell et al. 2012).
1. Roche et al 2014 - was about partial volume estimation using a bayesian MAP formulation that extends the mixel model. Our segmentation results do not account for partial voluming effects. In the text we added:

“Another factor not accounted for in our segmentation results was the effects of partial voluming, which adds uncertainty to tissue volume estimates. In (Roche 2014), researchers developed a method to more accurately estimate partial volume effects using only T1-weighed images from the ADNI dataset, which resulted in higher classification accuracy between Alzheimer’s disease (AD) patients and mild cognitively impaired (MCI) patients from normal controls (NL).”

2. Wolz et al 2014: LEAP algorithm on hippocampal volumes, compared between 1.5T and 3T found small bias (1.17% mean signed difference) between field strengths on ADNI data. Jovicich 2009 - they found that test-retest reproducibility does not change much cross platforms and field strengths. In the discussion section I’ve added:

“Even though we did not standardize the protocols and scanners within this study, the consortium is unbalanced in that there are 16 3T scanners of which 11 are Siemens. Of the Siemens 3T scanners, there is little variability in TR, TE and TI, however, there is more variance in the use of parallel imaging, the number of channels in the head coil (12, 20 or 32), and the field of view. We could not detect differences in scan-rescan reliability between field strengths, similar to the findings of (Jovicich 2009). Wolz and colleagues also could not detect differences in scan-rescan reliabilities of the hippocampus volumes estimated by the LEAP algorithm, but they detected a small bias between field strengths, where the hippocampus volumes in the 3T ADNI scanners were 1.17 % larger than the 1.5T (Wolz 2014). A two-sample T-test with unequal variances was run between the scaling factors of the 1.5T versus 3T scanners, and we could not detect differences in any ROI except for the left- and right- amydgala. We found that the scaling factors were lower than the 3T scanners (.9 versus 1.02), meaning that the amygdala volume estimates from the 1.5T were larger than those of the 3T. However, this interpretation is limited due to the small sample size of 1.5T scanners in this consortium.”

3. Wyman et al. 2013: I’m not sure if this is the correct paper you are referring to, but this paper emphasized the use of ADNI’s standardized data sets, which they say add (1) Greater rigor in reporting (2) The ability to compare various techniques side-by-side 3) The ability to evaluate robustness of a given technique and 4) The ability to replicate methods. In order to enable other researchers to compare our methods, evaluate robustness and replicate our study, we will provide, in the supplemental materials, the raw data on MRI volumes produced from Freesurfer, along with the python and R code to calculate scaling factors, leave-one-out calibration, and between-/within- site ICC.

4. Whitwell et al. 2012: Found different rates of hippocampal atrophy in the ADNI cohort than the Mayo Clinic Study of Aging cohort, even though there were no differences in hippocampal volumes between the matched cohorts. This is attributed to sampling of different populations rather than differences in hippocampal volumes due to differences in acquisition parameters. In the text I added:

“The other limitation of this study is that we assumed that subjects across all sites will come from the same population, and that stratification occurs solely from systematic errors within each site. In reality, sites may recruit from different populations, and the true disease effect will vary even more. For example, in a comparison study between the matched ADNI cohort and a matched Mayo Clinic Study of Aging cohort, researchers found different rates of hippocampal atrophy even though no differences in hippocampal volumes was detected (Whitwell 2012). This could be attributed to sampling from two different populations. This added site-level variability requires a larger site-level sample size, for an example of modeling this, see (Han 2011). ”

### References

1. Tyrone D. Cannon, Kristin Cadenhead, Barbara Cornblatt, Scott W Woods, Jean Addington, Elaine Walker, Larry J Seidman, Diana Perkins, Ming Tsuang, Thomas McGlashan, others. Reliability of neuroanatomical measurements in a multisite longitudinal study of youth at risk for psychosis. Human Brain Mapping 35, 2424–2434 (2014). Link

2. Jorge Jovicich, Moira Marizzoni, Roser Sala-Llonch, Beatriz Bosch, David Bartrés-Faz, Jennifer Arnold, Jens Benninghoff, Jens Wiltfang, Luca Roccatagliata, Flavio Nobili, others. Brain morphometry reproducibility in multi-center 3T MRI studies: a comparison of cross-sectional and longitudinal segmentations. Neuroimage 83, 472–484 Elsevier, 2013.

3. Hugo G. Schnack, Neeltje E.M. van Haren, Hilleke E. Hulshoff Pol, Marco Picchioni, Matthias Weisbrod, Heinrich Sauer, Tyrone Cannon, Matti Huttunen, Robin Murray, Ren� S. Kahn. Reliability of brain volumes from multicenter MRI acquisition: A calibration study. Human Brain Mapping 22, 312–320 Wiley-Blackwell, 2004. Link

4. Christine L Tardif, D Louis Collins, G Bruce Pike. Regional impact of field strength on voxel-based morphometry results. Human brain mapping 31, 943–957 Wiley Online Library, 2010.

5. Christine L Tardif, D Louis Collins, G Bruce Pike. Sensitivity of voxel-based morphometry analysis to choice of imaging protocol at 3 T. Neuroimage 44, 827–838 Elsevier, 2009.

6. Daniel-Paolo Streitbürger, André Pampel, Gunnar Krueger, Jöran Lepsien, Matthias L Schroeter, Karsten Mueller, Harald E Möller. Impact of image acquisition on voxel-based-morphometry investigations of age-related structural brain changes. Neuroimage 87, 170–182 Elsevier, 2014.

7. B. Fischl, D. H. Salat, E. Busa, M. Albert, M. Dieterich, C. Haselgrove, A. van der Kouwe, R. Killiany, D. Kennedy, S. Klaveness, A. Montillo, N. Makris, B. Rosen, A. M. Dale. Whole brain segmentation: automated labeling of neuroanatomical structures in the human brain. Neuron 33, 341-355 (2002).

8. Randy L Buckner, Denise Head, Jamie Parker, Anthony F Fotenos, Daniel Marcus, John C Morris, Abraham Z Snyder. A unified approach for morphometric and functional data analysis in young, old, and demented adults using automated atlas-based head size normalization: reliability and validation against manual measurement of total intracranial volume. Neuroimage 23, 724–738 Elsevier, 2004.

9. Duane C Boes, FA Graybill, AM Mood. Introduction to the Theory of Statistics. Series in probabili (1974).