Comparison between raters – reliability measures
The consistency of discriminating vocal cord motion between the
consultants was assessed. Discriminating vocal cord motion using the
5-category scale was less reliable (κ = 0.52) compared to using the
3-category scale (κ = 0.68), with both values falling in the fair to
good grouping of reliability measures11. Liu et al,
when assessing paediatric patients, reported a reliability of k=0.49 for
3 categories4. Assuming that nasendoscoping is more
challenging in the paediatric population and that they too did not use
audio, our results seem comparable. Madden et al reported higher
inter-rater reliability of 95%, but they used a binary scale, i.e.,
purposeful vocal fold motion or no purposeful vocal fold motion, and
their video data included audio. Nevertheless, Rosow et al who also
included audio and employed a binary scale, reported the reliability of
identifying the presence or absence of volitional adduction as only
k=0.3352. However their assessment was based on
stroboscopy making it difficult to draw any firm comparisons.