Vishnu Vidyadhara Raju received his doctorate in Electronics and Communication Engineering. His research work was supervised by Dr. Anil Kumar Vuppala. Here’s a summary of Vishnu’s thesis, Towards Building a Robust Telugu ASR System for Emotional Speech as explained by him:
The performance of speech recognition (ASR) system degrades when there is a mismatch between training and operating environments. The presence of expressive (emotional) speech is one among the mismatches in operating environments as majority of ASR systems are trained using neutral speech. The emotional state of the speaker induces changes in the speech characteristics and effects the ASR system in practical scenarios. The goal of this thesis is to improve the performance of ASR systems in these emotional conditions. The key challenge in addressing this research problem is the lack of resources, where the existing emotional databases are limited in the number of speakers and their size.
The main focus of this thesis is to create the required infrastructure to study this challenging problem for low resource Telugu language and present different exploratory studies to evaluate the accuracy of Telugu ASR systems. This thesis investigates several different techniques at various stages of the recognition process that are suitable for building an emotionally robust ASR system.
In the first study, prosody modification is employed at the pre-processing level of the speech recognizer. Model-based and feature-space adaptation approaches are also analyzed towards the improvement of ASR systems. These emotion adaptation strategies were studied using various deep neural network (DNNs) architectures and shown to be effective in comparison with baseline Gaussian mixture models (GMMs). All the experiments are conducted using IIT Kharagpur simulated emotion speech corpus (IITKGP-SESC) and IIIT-Hyderabad Telugu naturalistic emotional speech corpus (IIIT-H TNESC).
Some major conclusions from the work are:
- Prosody parameters such as pitch, duration and energy play a vital role in the analysis of emotional speech. In the first study, prosody modification is used at the pre-processing level to convert the emotional speech to neutral speech. This prosody modified emotional speech has shown an improvement in the performance of the ASR system at the preprocessing level.
- In the second study, prosody modification is used to create the training data from the given neutral to emotional speech. Different ASR models were built for the generated emotional speech along with the existing neutral speech. Prosody modification alone was not enough for improving the performance of the ASR system. Hence an emotion recognition block was also routed to the ASR system.
- Prosody modified speech has yielded lesser performance in comparison with directly adapted emotional speech. Hence the feasibility of extending the emotion adaptation algorithms of GMM-HMM acoustic models to DNN based models is explored using model-based and feature-space adaptation approaches in the third study. The best performance is observed for TDNN based acoustic models which use utterance level decisions as their objective function instead of a standard frame level decision.
- Feature space adaptation strategies have performed better than model space adaptation techniques. The auxiliary features appended to the conventional MFCCs contain emotion specific information, which helps in better handling of ASR systems. Model-based adaptation could have performed well, when sufficient emotional data is provided for adaptation. fMLLR based adaptation were effective in handling the emotion-specific information in comparison with MAP adaptation for building the ASR systems.