November 2022
Aditya Yadavalli received his Master of Science – Dual Degree in Computational Linguistics (CL). His research work was supervised by Dr. Anil Kumar Vuppala. Here’s a summary of his research work on Leveraging Language Relatedness to Improve Language Models in Speech Recognition Systems:
Automatic Speech Recognition (ASR) refers to the task of transcribing a given audio sample into its corresponding words. In recent years, interest in improving the ASR systems has grown tremendously as the commercial viability of the devices such as the Google Assistant, Alexa, and Siri that use these systems has grown exponentially. As is true for most tasks ML-based systems, building robust ASR systems requires large amounts of annotated data. However, publicly available resources to build ASR systems for Indian languages is scarce. Therefore, in order to counter this scarcity, we explore how language relatedness can improve the performance of Language Models (LM) and thereby the ASR systems that use them. We begin by investigating how pooling the datasets of closely related Indian languages such as Marathi and Hindi and building a HMM-based Hybrid multilingual ASR system can help improve the quality of transcriptions as measured by Character Error Rate (CER) and Word Error Rate (WER). Having found little evidence that such models outperform baseline monolingual ASR systems, we look to increase the lexical overlap between the two languages by using subword-based language models instead of the default choice, i.e., word-based languages. We investigate the performance of such models under various conditions and show that they outperform word-based counterparts as measured by upto 26.35% CER. With the increased interest in End-to-End ASR systems, we turn our focus to how LMs play a role in them. Previous research has found that Acoustic Models (AM) of an Automatic Speech Recognition (ASR) system are susceptible to dialect variations within a language, thereby adversely affecting the ASR. To counter this, researchers have proposed to build a dialect-specific AM while keeping the LM constant for all the dialects. In this thesis, we start by studying how dialect mismatched LMs adversely affect the performance of the ASR by considering three regional Telugu dialects: Telangana, Rayalaseema, and Coastal Andhra. We show that dialect variations that surface in the form of a different lexicon, grammar, and occasionally semantics can significantly degrade the performance of the LM under mismatched conditions. Therefore, this degradation has an adverse effect on the ASR even when dialect-specific AM is used so much so that it actively hinders the performance of the ASR, i.e., performs worse than an ASR system with no LM. Next, we look to remove the need for building dialect-specific model by proposing a multi-dialect ASR system that outperforms the dialect-specific ones. We find that simple pooling of the data and training a multi-dialect ASR benefits the low-resource dialect the most (Rayalaseema). Subsequently, we incorporate dialect-specific information by adding a Dialect ID into the LM of an End-to-End ASR. We train these End-to-End ASR systems under a Multi-Task Learning (MTL) framework, where the the primary task is to transcribe the audio and the secondary task is to predict the dialect. Such a model outperforms naive multi-dialect ASRs by up to 8.24% in relative WER.