Authorea

        Figure 2 shows that total intracranial volume has the lowest agreement between sites even after calibration, followed by Thalamus (which is consistent with findings in Schnack 2010). This is strange, considering that this ROI encompasses all others, so averaging over larger volume will reduce errors. Looking at tables 1&2 one can see also that test-retest variability of TIV was particularly low in three sites (4, 6, 11 ) , perhaps this step of the procedure failed in several scans originating from these sites and they have to be excluded from the analysis?