Deep Emotion Recognition using Facial, Speech and Textual Cues: A Survey
Abstract
With the development of social media and human-computer interaction, it
is essential to serve people by perceiving people's emotional state in
videos. In recent years, a large number of studies tackle the issue of
emotion recognition based on three most common modalities in videos,
that is, face, speech and text. The focus of this paper is to sort out
the relevant studies of emotion recognition using facial, speech and
textual cues based on deep learning techniques due to the lack of review
papers concentrating on the three modalities. In this paper, we firstly
introduce widely accepted emotion models for the purpose of interpreting
the definition of emotion. Then we introduce the state-of-the-art for
emotion recognition based on unimodality including facial expression
recognition, speech emotion recognition and textual emotion recognition.
For multimodal emotion recognition, we summarize the feature-level and
decision-level fusion methods in detail. In addition, the description of
relevant benchmark datasets, the definition of metrics and the
performance of the state-of-the-art in recent years are also outlined
for the convenience of readers to find out the current research
progress. Ultimately, we explore some potential research challenges and
opportunities to give researchers reference for the enrichment of
emotion recognition-related researches.