K Sai Teja received his Master of Science in Computer Science and Engineering (CSE). His research work was supervised by Dr. Vineet Gandhi. Here’s a summary of his research work on Towards building controllable Text to Speech systems.
Text to speech systems convert any given text to speech. They play a vital role in making Human computer interaction (HCI) possible. As humans, we don’t just rely on text (language) to communicate, we use many other mechanisms like voice, gestures, expressions etc. to communicate efficiently. In natural language processing, vocabulary and grammar tend to take centre stage, but those elements of speech only tell half the story. Affective prosody of speech provides larger context and gives meaning to words, and keeps listeners engaged. Current HCI systems largely communicate in text, they lack a lot of prosodic information which are crucial in a conversation. To make the HCI systems communicate in speech, text to speech systems should be able to synthesise speech which is expressive and controllable.
But the existing text to speech systems learn the average variation in the dataset it’s trained on which synthesises samples in a neutral way without much prosodic variations. To this end, we develop a text-to-speech system which can synthesise in the given emotion where the emotion is represented as a tuple of Arousal, Valance and Dominance (AVD) values.
Text to speech systems have a lot of complexities. Training such a system requires the data to be very clear, noiseless, and collecting such data is difficult. If the data is noisy, it will reflect unnecessary artefacts in the synthesised samples. Training emotion based text to speech models is considerably more difficult and not straightforward. The fact that obtaining emotion annotated data for the desired speaker is costly and is very subjective makes it a cumbersome task. Current emotion based systems can synthesise emotion with some limitations. (1) Emotion controllability comes at the cost of loss in quality, (2) Have discrete emotions which lack finer control, and (3) cannot be generalised to new speakers without the annotated emotion data.
We propose a system which overcomes the above mentioned problems by leveraging the largely available corpus of noisy speech annotated with emotions. Even though the data is noisy our technique trains an emotion based text to speech system which can synthesise desired emotion without any loss of quality in the output. We present a method to control the emotional prosody of Text to Speech (TTS) systems by using phoneme-level intermediate variances/features (pitch, energy, and duration) as levers.
We learn how the variances change with respect to emotion. We bring finer control in the synthesised speech by using AVD values, which can represent emotions in a 3D space. Our proposed method also doesn’t require the emotion annotated data for the target speaker. Once trained on the emotion annotated data it can be applied to any system which has the prediction of the variances as an intermediate step.
With thorough experimental studies, we show that the proposed method improves over the prior art in accurately emulating the desired emotions while retaining the naturalness of speech. We extend the traditional evaluation of using individual sentences for a more complete evaluation of HCI systems.
We present a novel experimental setup by replacing an actor with a TTS system in offline and live conversations. The emotion to be rendered is either predicted or manually assigned. The results show that the proposed method is strongly preferred over the state-of-the-art TTS system and adds the much coveted “human touch” in machine dialogue.
May 2023