Empathic Machines: Using Intermediate Features as Levers to Emulate Emotions in Text-To-Speech Systems

We present a method to control the emotional prosody of Text to Speech (TTS) systems by using phoneme-level intermediate features (pitch, energy, and duration) as levers. As a key idea, we propose Differential Scaling (DS) to disentangle features relating to affective prosody from those arising due to acoustics conditions and speaker identity. With thorough experimental studies, we show that the proposed method improves over the prior art in accurately emulating the desired emotions while retaining the naturalness of speech. We extend the traditional evaluation of using individual sentences for a more complete evaluation of HCI systems. We present a novel experimental setup by replacing an actor with a TTS system in offline and live conversations. The emotion to be rendered is either predicted or manually assigned. The results show that the proposed method is strongly preferred over the state-of-the-art TTS system and adds the much-coveted "human touch" in machine dialogue.


Note: The samples take little time to load. Please wait sometime for the samples to load.


1. Fastspeech2, Fcl-Taco2, Proposed method - Emotional Variation Adaptor (EVA), Differential scalar (DS)

arch
arch

3. Demo of TTS with various models

The audio samples shown here are generated with various models. Columns are various models, each row corresponds to single text. We compare with the following prioir art- FastSpeech2, Fcl-Taco2, Cai et al (TTS1) - "Emotion controllable speech synthesis using emotion-unlabeled dataset with the assistance of cross-domain speech emotion recognition." and Sivaprasad et al. (TTS2) - Emotional prosody control for speech generation. These are for just evaluating the naturalness of synthesized speech. For emotive models the samples are shown with randomly picked emotion.
Samples FastSpeech2 Fcl-Taco2 FastSpeech2π TTS1 TTS2 Fcl-Taco2 + DS(Our model) FastSpeech2π + DS(Our model)
"A friendly waiter taught me a few words of Italian."
"I visited museums and sat in public gardens."
"Shattered windows and the sound of drums."
"He thought it was time to present the present."

2. Dialogues with and without emotion.

Some of the famous dialogues from the movies generated with FastSpeech2π (Without emotion) and FastSpeech2π + DS (With Emotion)
arch
Without emotion
With emotion
arch
Without emotion
With emotion
arch
Without emotion
With emotion
arch
Without emotion
With emotion

4. Emotion Control in Arousal (Y-axis), Valance (X-axis) space with constant Dominance.

1. Remember what she said in my last letter.
-3
0
-2
3
2
1
-1
1-1
2
3
-2
-3
2. There is no place like home.
-3
-2
3
2
1
-1
1-1
2
3
-2
-3

5. Demo of Emotion control in various Models

Comparing the emotion control of our models to existing models.
Emotions TTS1 TTS2 Fcl-Taco2 + DS FastSpeech2π + DS
Angry
Fear
-
-
Sad
Happy
Neutral

6. Demo of Theatre experiment.

We present the threatre conversations where one of the actor(female) is replaced with a TTS system as shown in section 5.2. We use FastSpeech2π as our baseline. We compare the conversations generated with FastSpeech2π, our model with emotions predicted from "Emotion Recognition in Conversations (ERC)" model and our model with hand picked emotions by senior theatre director.
Dialogues FastSpeech2π Our model + predicted emotions Our model + Hand picked emotions
Death of a salesman
Hell
Speed the plow