M Sai Ganesh - Automatic speech recognition -

Mirishkar Sai Ganesh, supervised by Dr. Anil Kumar V received his doctorate in Electronics and Communication Engineering (ECE). Here’s a summary of his research work on Towards Building an automatic speech recognition system in Indian context using deep learning:

Automatic Speech Recognition (ASR) systems are increasingly prevalent in our daily lives, with commercial applications such as Siri, Alexa, and Google Assistant. However, the focus of these systems has been largely angled towards English, leaving a considerable portion of non-English speakers underserved. This is particularly evident in India, a linguistically diverse country with many languages classified as low-resource in the context of ASR due to the scarcity of annotated speech data. This thesis aims to bridge this gap, focusing on enhancing ASR systems for Indian languages using deep learning methodologies. India is a land of language diversity. There are approximately 2000 languages spoken around, and among those officially registered are 23. Of those, very few have ASR capability. This is because building an ASR system requires thousands of hours of annotated speech data, a vast amount of text, and a lexicon that can span all the words in the languages. The necessity for a comprehensive presence in the diverse Indian markets demands the development of multilingual Automatic Speech Recognition (ASR) systems. It’s a common scenario where ASR systems for Indian languages have to be implemented in low-resourced contexts. Furthermore, the complexity of the linguistic landscape is amplified due to the high prevalence of bilingualism in the Indian population, leading to frequent instances of code-switching and linguistic borrowing between languages. Operating concurrent ASR systems that can handle code-switching in the Indian context presents a considerable challenge. This predicament has spurred our research endeavors, driving us to focus on constructing a large corpus for one language and leveraging its phonetic space on other language families in monolingual and multilingual ASR scenarios. This thesis incorporates a crowd-sourcing strategy to collect an extensive speech corpus, particularly for Telugu. Using this approach, around 2000 hours of Telugu speech data, capturing regional variations through three modes: spontaneous, conversational, and read, under various background conditions, has been collected. This data served as a foundation for developing and evaluating neural network architectures tailored to the characteristics of Indian languages. We also explored the potential of self-supervised learning to understand and enhance learned representations, fine-tuning them to suit different language families and data sizes. This approach led to insights into the shared phonetic space among Indian languages, allowing for the development of a multilingual ASR system utilizing a joint acoustic model approach. These studies mark a significant stride towards overcoming the challenges of multilingualism in the Indian context, setting a path for creating more inclusive and effective ASR systems. The research findings presented in this thesis not only contribute towards building efficient and accurate ASR systems for low-resource Indian languages but also underscore the power of deep learning approaches in linguistic technology. It is our hope that this work will motivate and aid further research in this direction, promoting linguistic diversity and broadening access to information and communication technologies for speakers of low-resource languages.

Keywords: Automatic Speech Recognition, Indian Languages, Crowd-Sourced, Low-Resource, Spontaneous Speech, Conversational Speech, Read Speech, Word Error Rate, Self-Supervised Learning, Common Phone Set, Common Label Set, Joint Acoustic Model, Time-Delay Neural Network, Transformer, Conformer, Wav2Vec.

November 2023

M Sai Ganesh – Automatic speech recognition