Authorea

Does the method here proposed offer improved multi-centric reliability than other studies? The across site reliability measures obtained with the proposed calibration do not appear to be placed in perspective with the vast literature on this topic (for example, but not limited to: Wolz et al. 2014; Roche et al., 2014; Jovicich et al., 2013). In particular, this last study shows inter-site ICC measures on many of the same structures reported here, also obtained using Freesurfer, but with notably higher reliability than the calibrated results reported here:

Structure Between site ICC after calibration in this study (Fig. 2) Jovicich et al., 2013 (Suppl. Table 1)

Lateral ventricle 0.96 0.998

Thalamus 0.78 0.972

Hippocampus 0.88 0.951

Amygdala 0.82 0.939

Caudate 0.92 0.942

Authors should discuss potential reasons for such differences, for example in the context of acquisition variability, calibration methods or segmentation methods (Freesurfer longitudinal versus cross-sectional or other methods).