Kirandevraj R - Audio Keyword Spotting -

November 2022

Kirandevraj R received his Master of Science – Dual Degree in Computer Science and Engineering (CSE). His research work was supervised by Prof. C V Jawahar. Here’s a summary of his research work on Open-Vocabulary Audio Keyword Spotting with Low Resource Language Adaptation:

Query by example Spoken Term Detection is the problem of retrieving audio documents from an archive that contains a spoken query provided by a user. This is a zero-shot task since no specific training or lexical information is required to represent the spoken query. Thus it enables keyword search on speech without requiring a full speech recognition system. State-of-the-art solutions typically rely on Dynamic Time Warping (DTW) based template matching using phone posterior features estimated by Deep Neural Networks (DNN).

In this thesis, we aim to explore the usage of the Automatic Speech Recognition (ASR) system for Keyword Spotting. We demonstrate that the intermediate representation of ASR can be used for open vocabulary keyword spotting. With this, we show the effectiveness of using Connectionist Temporal Classification (CTC) loss for learning word embeddings for keyword spotting. We propose a novel method of using the CTC loss function with the traditional triplet loss function to learn word embeddings for keyword spotting on the TIMIT English language audio dataset. We show this method achieves an Average Precision (AP) of 0.843 over 344 words unseen by the model trained on the TIMIT dataset. In contrast, the Multi-View recurrent method that learns jointly on the text and acoustic embeddings achieves only 0.218 for out-of-vocabulary words.

We propose a novel method to generalize our approach to Tamil, Vallader, and Hausa low-resource languages. Here we use transliteration to convert the Tamil language script to English such that the Tamil words sound similar written with English alphabets. The model predicts the transliterated text for input Tamil audio with CTC and triplet loss functions. We show that this method helps transfer the knowledge learned from high resource language English to low resource language Tamil.

We further reduce the model size to make it work in a small footprint scenario like mobile phones. To this extent, we explore various knowledge distillation loss functions such as MSE, KL Divergence, and CosEmbedding loss functions. We observe that small-footprint ASR representation is competitive with knowledge distillation methods for small-footprint keyword spotting.

This methodology makes use of existing ASR networks trained with massive datasets. It converts them into open vocabulary keyword spotting systems that can also be generalized to low-resource language.

Kirandevraj R – Audio Keyword Spotting