Statistical analysis
AgreementAgreement was computed using the ‘percentage agreement’ measure, which
provides the percentage of cases in which two or more raters scored
identically. To assess inter-rater agreement, two percentage agreement
measures were computed, i.e., the overall agreement between raters for
all categories combined (overall percentage agreement); and the
agreement specific to a category (specific agreement). The purpose of
‘specific agreement’ is to objectively demonstrate if the clinicians are
in better agreement while rating cases belonging to some categories more
than others (such as fully mobile category as opposed to paresis).
Intra-rater agreement (i.e., test-retest) was also computed for each
consultant over the three sessions using overall percentage agreement.
ReliabilityInter- and intra-rater reliability was calculated using the generalised
Fleiss’s kappa 4,6,7 to compare with comparable
studies reported in the literature. The kappa statistic ranges from 0 to
1, where 0 depicts that raters are in agreement only by chance. Any
value over 0 may be interpreted as representing: poor (below 0.40), fair
to good (between 0.40-0.75) and excellent (above 0.75) agreement beyond
chance. The rating scale was considered as an ordinal scale and an
ordinal weighting scheme was used in the computation of Fleiss’s kappa4,6.
For the intra rater study, we had 3 sessions (i.e., replicates) per
sample, which is appropriate8,9 since moderately high
(>0.60) reliability was expected based on the trend in the
literature1,3,4. Since reliability was expected to be
lower in the inter-rater study (as low as 0.33 3), 6
raters are appropriate10