Assessing Joint Engagement Between Children With Autism Spectrum
Disorder and Their Parents During The Home Intervention Sessions From
the Expressive Language Aspect
Abstract
The World Health Organization (WHO) has instituted the Caregiver Skill
Training (CST) program to assist families with children diagnosed with
Autism Spectrum Disorder. The Joint Engagement Rating Inventory (JERI)
protocol evaluates participants’ engagement levels within the CST
initiative. Traditionally, JERI assessments rely on retrospective video
analysis conducted by qualified professionals, thus incurring
substantial labor costs. This study aims to augment the evaluation
efficiency of the Expressive Language Level and Use (EXLA) criterion
within JERI, striving for consistency with human-based scoring. To this
end, we introduce a multimodal behavioral signal-processing framework
designed to analyze both child and caregiver behaviors, thereby offering
grading recommendations as an alternative to medical professional input.
Initially, raw audio and video signals are segmented into concise
intervals via voice activity detection, speaker diarization and speaker
age classification, serving the dual purpose of eliminating non-speech
content and tagging each segment with its respective speaker.
Subsequently, we extract an array of audio-visual features, encompassing
our proposed interpretable, hand-crafted textual features, end-to-end
audio embeddings and end-to-end video embeddings. Finally, these
features are fused at the feature level to train a linear regression
model aimed at predicting the EXLA scores. Our framework has been
evaluated on the largest in-the-wild database currently available under
the CST program. Experimental results indicate that the proposed system
achieves a Pearson Correlation Coefficient of 0.713 against the expert
ratings, evidencing performance comparable to that of human experts.
This approach not only provides immediate feedback for CST participants
but also optimizes resource allocation in less developed regions.