Controlling Prosody in End-to-End TTS: A Case Study on Contrastive Focus Generation

Latif, Siddique and Kim, Inyoung and Calapodescu, Ioan and Besacier, Laurent (2021) Controlling Prosody in End-to-End TTS: A Case Study on Contrastive Focus Generation. In: 25th Conference on Computational Natural Language Learning (CoNLL 2021), 10 Nov - 11 Nov 2021, Punta Cana, Dominican Republic.

[img]
Preview
Text (Published Version)
2021.conll-1.42.pdf
Available under License Creative Commons Attribution 4.0.

Download (1MB) | Preview
[img]
Preview
Text (Proceedings Front matter)
2021.conll-Proceedings Front matter.pdf
Available under License Creative Commons Attribution 4.0.

Download (218kB) | Preview

Abstract

While End-2-End Text-to-Speech (TTS) has made significant progresses over the past few years, these systems still lack intuitive user controls over prosody. For instance, generating speech with fine-grained prosody control (prosodic prominence, contextually appropriate emotions) is still an open challenge. In this paper, we investigate whether we can control prosody directly from the input text, in order to code information related to contrastive focus which emphasizes a specific word that is contrary to the presuppositions of the interlocutor. We build and share a specific dataset for this purpose and show that it allows to train a TTS system were this fine-grained prosodic feature can be correctly conveyed using control tokens. Our evaluation compares synthetic and natural utterances and shows that prosodic patterns of contrastive focus (variations of Fo, Intensity and Duration) can be learnt accurately. Such a milestone is important to allow, for example, smart speakers to be programmatically controlled in terms of output prosody.


Statistics for USQ ePrint 45595
Statistics for this ePrint Item
Item Type: Conference or Workshop Item (Commonwealth Reporting Category E) (Paper)
Refereed: Yes
Item Status: Live Archive
Faculty/School / Institute/Centre: Historic - Faculty of Health, Engineering and Sciences - School of Sciences (6 Sep 2019 - 31 Dec 2021)
Faculty/School / Institute/Centre: Historic - Faculty of Health, Engineering and Sciences - School of Sciences (6 Sep 2019 - 31 Dec 2021)
Date Deposited: 08 Mar 2022 06:43
Last Modified: 16 Mar 2022 04:14
Uncontrolled Keywords: End-to-End TTS, fine-grained prosody control, contrastive focus, interrogative/assertive sentences
Fields of Research (2008): 08 Information and Computing Sciences > 0801 Artificial Intelligence and Image Processing > 080109 Pattern Recognition and Data Mining
08 Information and Computing Sciences > 0801 Artificial Intelligence and Image Processing > 080107 Natural Language Processing
Fields of Research (2020): 46 INFORMATION AND COMPUTING SCIENCES > 4602 Artificial intelligence > 460211 Speech production
46 INFORMATION AND COMPUTING SCIENCES > 4602 Artificial intelligence > 460208 Natural language processing
46 INFORMATION AND COMPUTING SCIENCES > 4611 Machine learning > 461104 Neural networks
46 INFORMATION AND COMPUTING SCIENCES > 4611 Machine learning > 461103 Deep learning
Identification Number or DOI: doi:10.18653/v1/2021.conll-1.42
URI: http://eprints.usq.edu.au/id/eprint/45595

Actions (login required)

View Item Archive Repository Staff Only