Deep architecture enhancing robustness to noise, adversarial attacks, and cross-corpus setting for speech emotion recognition

Latif, Siddique and Rana, Rajib ORCID: https://orcid.org/0000-0002-0506-2409 and Khalifa, Sara and Jurdak, Raja and Schuller, Bjorn W. (2020) Deep architecture enhancing robustness to noise, adversarial attacks, and cross-corpus setting for speech emotion recognition. In: 21st Annual Conference of the International Speech Communication Association: Cognitive Intelligence for Speech Processing (INTERSPEECH 2020), 25–29 Oct 2020, Shanghai, China.

[img] Text (Published Version)
3190.pdf
Restricted


Abstract

Speech emotion recognition systems (SER) can achieve high accuracy when the training and test data are identically distributed, but this assumption is frequently violated in practice and the performance of SER systems plummet against unforeseen data shifts. The design of robust models for accurate SER is challenging, which limits its use in practical applications. In this paper we propose a deeper neural network architecture wherein we fuse Dense Convolutional Network (DenseNet), Long short-term memory (LSTM) and Highway Network to learn powerful discriminative features which are robust to noise. We also propose data augmentation with our network architecture to further improve the robustness. We comprehensively evaluate the architecture coupled with data augmentation against (1) noise, (2) adversarial attacks and (3) cross-corpus settings. Our evaluations on the widely used IEMOCAP and MSP-IMPROV datasets show promising results when compared with existing studies and state-of-the-art models.


Statistics for USQ ePrint 41411
Statistics for this ePrint Item
Item Type: Conference or Workshop Item (Commonwealth Reporting Category E) (Paper)
Refereed: Yes
Item Status: Live Archive
Additional Information: Copyright © 2020 ISCA.
Faculty/School / Institute/Centre: Current - Faculty of Health, Engineering and Sciences - School of Sciences (6 Sep 2019 -)
Faculty/School / Institute/Centre: Current - Institute for Resilient Regions
Date Deposited: 17 Feb 2021 04:54
Last Modified: 08 Jun 2021 00:30
Uncontrolled Keywords: speech emotion, mixup, data augmentation, convolutional neural networks, DenseNet, highway network.
Fields of Research (2008): 08 Information and Computing Sciences > 0801 Artificial Intelligence and Image Processing > 080105 Expert Systems
08 Information and Computing Sciences > 0801 Artificial Intelligence and Image Processing > 080109 Pattern Recognition and Data Mining
08 Information and Computing Sciences > 0801 Artificial Intelligence and Image Processing > 080107 Natural Language Processing
Fields of Research (2020): 46 INFORMATION AND COMPUTING SCIENCES > 4602 Artificial intelligence > 460212 Speech recognition
46 INFORMATION AND COMPUTING SCIENCES > 4602 Artificial intelligence > 460206 Knowledge representation and reasoning
46 INFORMATION AND COMPUTING SCIENCES > 4602 Artificial intelligence > 460208 Natural language processing
Socio-Economic Objectives (2008): C Society > 92 Health > 9202 Health and Support Services > 920203 Diagnostic Methods
C Society > 92 Health > 9202 Health and Support Services > 920209 Mental Health Services
C Society > 92 Health > 9202 Health and Support Services > 920202 Carer Health
Identification Number or DOI: doi:10.21437/Interspeech.2020-3190
URI: http://eprints.usq.edu.au/id/eprint/41411

Actions (login required)

View Item Archive Repository Staff Only