Abstract
Speech emotion recognition plays an important role in many applications,
but the task is challenging due to various factors such as background
noise, different speaker speech characteristics, etc. The well known
speech emotion recognition system ACRNN uses CNN to extract local
features of speech signals and attention mechanism focuses on the parts
with prominent emotions. However, it has no ability to capture long-term
global information and it also has no ability to jointly attend to the
information from different representation subspaces at different
positions because only one single attention module is used. In order to
settle out the drawbacks of ACRNN, CoRNN is proposed in this letter by
applying Conformer to replace the modules of CNN and attention module.
The experimental results on IEMOCAP dataset demonstrate the unweighted
average recall of the proposed CoRNN can achieve 65.53%, which improves
0.79% comparing with ACRNN.