Statistical analysis
AgreementAgreement was computed using the ‘percentage agreement’ measure, which provides the percentage of cases in which two or more raters scored identically. To assess inter-rater agreement, two percentage agreement measures were computed, i.e., the overall agreement between raters for all categories combined (overall percentage agreement); and the agreement specific to a category (specific agreement). The purpose of ‘specific agreement’ is to objectively demonstrate if the clinicians are in better agreement while rating cases belonging to some categories more than others (such as fully mobile category as opposed to paresis). Intra-rater agreement (i.e., test-retest) was also computed for each consultant over the three sessions using overall percentage agreement.
ReliabilityInter- and intra-rater reliability was calculated using the generalised Fleiss’s kappa 4,6,7 to compare with comparable studies reported in the literature. The kappa statistic ranges from 0 to 1, where 0 depicts that raters are in agreement only by chance. Any value over 0 may be interpreted as representing: poor (below 0.40), fair to good (between 0.40-0.75) and excellent (above 0.75) agreement beyond chance. The rating scale was considered as an ordinal scale and an ordinal weighting scheme was used in the computation of Fleiss’s kappa4,6.
For the intra rater study, we had 3 sessions (i.e., replicates) per sample, which is appropriate8,9 since moderately high (>0.60) reliability was expected based on the trend in the literature1,3,4. Since reliability was expected to be lower in the inter-rater study (as low as 0.33 3), 6 raters are appropriate10