Rashmi Kethireddy, supervised by Dr. Suryakanth V G received her doctorate in Computer Science and Engineering (CSE). Here’s a summary of her research work on Dialect classification and its application to dialectal speech recognition:
Major goal of this thesis is to study the dialectal variations and improve the performance of speech recognition with the embeddings derived from the improved dialect classification system. Initial studies focused on improvement of the dialect classification system with three major dialects (Australian, Britain, and American) of English.
In order to improve the performance of the dialect classification system and based on the analysis of dialectal variations, advanced signal processing approaches were proposed with the traditional i-vector approach. Dialectal variations exist at each and across frames. The features that provide high spectral resolution will help to capture subtle differences between dialects at frame-level. So, this thesis proposed to use zero-time windowing (ZTW) and single frequency filtering (SFF) based features that provide high spectral resolution without compromising temporal resolution. Along with frame level spectral resolution, longer temporal context will constitute for dialect classification. So, approaches that enhance the temporal context of proposed features (ZTW and SFF) approaches such as delta and double delta coefficients (∆+∆∆), shifted delta coefficients (SDCs) are experimented. It is observed that the dialect classification system has given promising performance with the proposed features with temporal context provided by ∆+∆∆ and SDCs. Further, signal processing approaches that can provide long temporal summarization (across frames) such as frequency domain linear prediction (FDLP) are proposed for dialect classification. From experiments with FDLP based features, it is observed that long temporal summarization provided by FDLP based features is advantageous for discriminating dialects. So, both the signal processing approaches that provide high spectral resolution (ZTW and SFF) and long temporal summarization (FDLP) have shown to give promising performance in dialect classification when compared to commonly used short time Fourier transform (STFT) based features.
Further, due to promising performance by deep neural networks in classification tasks and its ability to provide longer temporal context, simpler (CNN) to advanced deep neural network (TCN, TDNN, and ECAPA-TDNN) architectures that provide different temporal contexts are investigated. It is observed that advanced neural network architectures improved the performance of dialect classification. Further, on evaluation of best of both stages, it is observed that ECAPA-TDNN performed better with proposed features (SFF) in classification of dialects.
These observations with major dialects of English will be extended to accents/dialects of Indian English to develop automatic identification of native language (or L1) of speakers.
In most dialectal speech recognition studies, Indian English is considered as a single dialect even though it has different native speakers. However, variations in Indian English with respect to native language exist. From the literature in dialectal ASR, it is observed that inclusion of dialect embeddings improves the performance of dialectal ASR systems.
So, this thesis proposed to investigate the embeddings derived from the L1 identification system with the Indian English ASR system to learn the foreign dialectal variations based on L1. These L1 embeddings will be derived from an improved dialect classification system (developed based on observations with major dialects of English) will be included along with the Indian English ASR system to improve the performance.
February 2024