This important point was also made by reviewer 2, and we have clarified the overall goal of this project. We will restate our response here. The overall goal of this project was not to claim that the method of scanning 12 phantom subjects was cost effective. Rather, the goal was to measure MRI-related biases when systems are not standardized, and then see how one can overcome these biases with proper sample sizes, rather than a costly calibration method or harmonization (for the case of retrospective data). This also allows sites the freedom to upgrade hardware/software or even change sequences during a study. This might be an incentive for sites to contribute data even if they are given little financial support. The phantom calibration aspect has been minimized and our statistical model that accounts for MRI-related biases has been emphasized. The measurements of that bias (which were estimated via calibration) are an important part of this study because they validate the scaling assumption of the statistical model and provide researchers values to plug into the power equation. Our framework provides an alternative method to ADNI harmonization, rather than a strict improvement. The human phantom calibration showed that the overall absolute agreement between sites improves to the same level of ADNI-type harmonization. Our results are compared to other harmonization efforts in the manuscript and in the following response.

Our sites have used sequences that are similar to the vendor provided-T1 sequences, and \cite{jovicich2013brain} found that high multicenter reliability can be achieved using these standard vendor sequences with very few calibration efforts. However, many of the sites in our consortium are in the middle of longitudinal studies within their sites, and are hesitant to make even very small protocol changes, despite the result from \cite{jovicich2013brain}, which was for the longitudinal processing stream. Our statistical model was for a cross-sectional design, and the evaluation of scaling bias is important to optimize sample sizes for the cross-sectional case.