Towards Discriminative Representation Learning for Speech Emotion Recognition

Towards Discriminative Representation Learning for Speech Emotion Recognition

Runnan Li, Zhiyong Wu, Jia Jia, Yaohua Bu, Sheng Zhao, Helen Meng

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence
Main track. Pages 5060-5066. https://doi.org/10.24963/ijcai.2019/703

In intelligent speech interaction, automatic speech emotion recognition (SER) plays an important role in understanding user intention. While sentimental speech has different speaker characteristics but similar acoustic attributes, one vital challenge in SER is how to learn robust and discriminative representations for emotion inferring. In this paper, inspired by human emotion perception, we propose a novel representation learning component (RLC) for SER system, which is constructed with Multi-head Self-attention and Global Context-aware Attention Long Short-Term Memory Recurrent Neutral Network (GCA-LSTM). With the ability of Multi-head Self-attention mechanism in modeling the element-wise correlative dependencies, RLC can exploit the common patterns of sentimental speech features to enhance emotion-salient information importing in representation learning. By employing GCA-LSTM, RLC can selectively focus on emotion-salient factors with the consideration of entire utterance context, and gradually produce discriminative representation for emotion inferring. Experiments on public emotional benchmark database IEMOCAP and a tremendous realistic interaction database demonstrate the outperformance of the proposed SER framework, with 6.6% to 26.7% relative improvement on unweighted accuracy compared to state-of-the-art techniques.
Keywords:
Natural Language Processing: Speech
Natural Language Processing: Sentiment Analysis and Text Mining