Anandaswarup Vadapalli -

Anandaswarup Vadapalli supervised by Dr. Dr. Kishore Prahallad received his doctorate in Computer Science and Engineering (CSE). Here’s a summary of his research work on Phrasing in Text-to-Speech Synthesis:

Spoken utterances have an inherent structure in the sense that some words seem to group naturally together and some words seem to have a notable break or disjuncture between them. This can be described in terms of prosodic phrasing, meaning that a spoken utterance has a prosodic phrase structure,similar to how a written utterance has a syntactic phrase structure.

Phrase breaks in natural speech are important; they are physiologically essential, help emphasize content, and improve the intelligibility of speech. The process of inserting phrase breaks in an utterance is called phrasing. In the context of speech synthesis, phrasing is a crucial step. It breaks long utterances into meaningful units of information and improves the intelligibility of the synthesized speech.

Traditional methods of phrase break prediction have used discrete linguistic resources like part-of.speech (POS) sequence information for modeling these breaks. These methods cannot be used for languages where the necessary linguistic resources are not readily available, resulting in unsupervised methods of inducing word representations which can be used as surrogates for POS tags in phrase break prediction. However these methods are not suitable in the context of Indian languages, which are agglutinative in nature resulting in an increase in vocabulary size. This thesis presents the use of word terminal syllables (last syllable of the word) to model phrase breaks. We demonstrate the correlation between these terminal syllables and acoustic breaks found in the speech signal. We utilize these terminal syllables in building models for phrase break prediction from text in six Indian languages and demonstrate by means of objective and subjective measures that these models perform as well as traditional models using POS sequence information.

The discrete linguistic representations (like POS tags, induced POS tags, word-terminal syllables)used by traditional methods of phrase break prediction require a hard classification of words into a set of predefined discrete classes, which raises issues when there is ambiguity in the linguistic representation of a word. Moreover such a representation does not capture the co-occurrence statistics of the words,i. e. they do not take into account the distributional behaviour of words. Both these issues can be addressed by the use of continuous dimensional representations of words also known as word embeddings.This thesis presents a neural network architecture to induce task specific word embeddings for phrase break prediction. We show that our proposed architecture combines feature induction and phrase break prediction in a single framework and thus avoids the two stage process required while using features derived from Latent Semantic Analysis (LSA). We train our model on audiobook data and show that these task specific word features are better word representations as compared to those derived using LSA, for phrase break prediction.

With the advent of deep learning, there have been attempts to apply deep neural networks (DNNs) to phrase break prediction. While deep neural networks are able to effectively capture dependencies across features, they lack the ability to capture long term relations that are spread over time. On the other hand,recurrent neural networks (RNNs) are able to capture long-term temporal relations and thus are better suited for tasks where sequences have to be modeled. This thesis presents the use of RNNs for phrase break prediction, and shows by means of experimental results that they perform better as compared to DNNs.

Traditional techniques for learning word embeddings learn static or context-free word embeddings,meaning that for each word in the vocabulary, one fixed embedding is learnt for all possible contexts in which that particular word appears in the text corpus. While static embeddings have proved to be effective in NLP applications, they do not take into account the fact that words have different meanings in different contexts. To address this issue, context-sensitive word representations, also called dynamic or contextual word embeddings were developed. The most popular technique for learning dynamic word embeddings is BERT (Bidirectional Encoder Representations from Transformers). This thesis presents work on phrase break prediction using bidirectional encoder representations learnt from fine-tuning a pretrained BERT model, with an additional token classification layer, on phrase break prediction. We show that representations using this model outperform task-specific static word embeddings learnt using a BLSTM token classification model trained from scratch.

Finally, this thesis presents work which attempts to answer the following questions: (a) Is there any utility in incorporating an explicit external phrasing model in an end-to-end TTS system? and (b)How do you evaluate the effectiveness of a phrasing model in an end to-end TTS system? We perform subjective listening tests and show that incorporating explicit phrasing models in an end-to-end TTS system results in better listening comprehension, especially in the context of children’s story synthesis

March 2025