Comparison between raters – reliability measures
The consistency of discriminating vocal cord motion between the consultants was assessed. Discriminating vocal cord motion using the 5-category scale was less reliable (κ = 0.52) compared to using the 3-category scale (κ = 0.68), with both values falling in the fair to good grouping of reliability measures11. Liu et al, when assessing paediatric patients, reported a reliability of k=0.49 for 3 categories4. Assuming that nasendoscoping is more challenging in the paediatric population and that they too did not use audio, our results seem comparable. Madden et al reported higher inter-rater reliability of 95%, but they used a binary scale, i.e., purposeful vocal fold motion or no purposeful vocal fold motion, and their video data included audio. Nevertheless, Rosow et al who also included audio and employed a binary scale, reported the reliability of identifying the presence or absence of volitional adduction as only k=0.3352. However their assessment was based on stroboscopy making it difficult to draw any firm comparisons.