Direct modelling of speech emotion from raw speech

Latif, Siddique and Rana, Rajib ORCID: and Khalifa, Sara and Jurdak, Raja and Epps, Julien (2019) Direct modelling of speech emotion from raw speech. In: 20th Annual Conference of the International Speech Communication Association: Crossroads of Speech and Language (INTERSPEECH 2019), 15-19 Sept 2019, Graz, Austria.


Speech emotion recognition is a challenging task and heavily depends on hand-engineered acoustic features, which are typically crafted to echo human perception of speech signals. However, a filter bank that is designed from perceptual evidence is not always guaranteed to be the best in a statistical modelling framework where the end goal is for example emotion classification. This has fuelled the emerging trend of learning representations from raw speech especially using deep learning neural networks. In particular, a combination of Convolution Neural Networks (CNNs) and Long Short Term Memory (LSTM) have gained great traction for the intrinsic property of LSTM in learning contextual information crucial for emotion recognition; and CNNs been used for its ability to overcome the scalability problem of regular neural networks. In this paper, we show that there are still opportunities to improve the performance of emotion recognition from the raw speech by exploiting the properties of CNN in modelling contextual information. We propose the use of parallel convolutional layers to harness multiple temporal resolutions in the feature extraction block that is jointly trained with the LSTM based classification network for the emotion recognition task. Our results suggest that the proposed model can reach the performance of CNN trained with hand-engineered features from both IEMOCAP and MSP-IMPROV datasets.

Statistics for USQ ePrint 37149
Statistics for this ePrint Item
Item Type: Conference or Workshop Item (Commonwealth Reporting Category E) (Paper)
Refereed: Yes
Item Status: Live Archive
Additional Information: Copyright © 2019 ISCA.
Faculty/School / Institute/Centre: Current - Faculty of Health, Engineering and Sciences - No Department (1 Jul 2013 -)
Faculty/School / Institute/Centre: Current - Faculty of Health, Engineering and Sciences - No Department (1 Jul 2013 -)
Date Deposited: 25 Mar 2020 07:29
Last Modified: 08 Jun 2021 00:30
Uncontrolled Keywords: speech emotion, raw speech, convolutional neural networks, long short term memory
Fields of Research (2008): 08 Information and Computing Sciences > 0801 Artificial Intelligence and Image Processing > 080109 Pattern Recognition and Data Mining
09 Engineering > 0906 Electrical and Electronic Engineering > 090609 Signal Processing
Fields of Research (2020): 46 INFORMATION AND COMPUTING SCIENCES > 4602 Artificial intelligence > 460212 Speech recognition
Identification Number or DOI: doi:10.21437/Interspeech.2019-3252

Actions (login required)

View Item Archive Repository Staff Only