Chayan Kochar supervised by Prof. Dipti Mishra Sharma received his Master of Science – Dual Degree in Computational Linguistics (CL). Here’s a summary of his research work on Towards disfluency identification in Indian languages:
In the natural course of spoken language, individuals often engage in thinking and self-correction during speech production. These instances of interruption or correction are commonly referred to as disfluencies. NLP models, usually trained on textual data, can have difficulty while working with such data – that is, the audio transcripts from conversational speech. When performing downstream NLP tasks like Machine Translation, we must be aware of the presence of such disfluencies in the training data. These linguistic elements can be systematically removed, or handled as required, to enhance data quality, ultimately improving the output for a given task. For quite some time, disfluencies have been the subject of research. However, most of it is on English, Chinese and other European languages, and not much has been done in this direction on Indian languages. In this thesis, we present a comprehensive research on disfluencies, focusing on the following Indian languages: Hindi, Bengali, Marathi, Telugu, Kannada and Tamil. We carry out some preliminary tasks which help us understand the intricacies of disfluencies in detail – showcasing how it can affect Machine Translation, Questions Answering etc. Initially we studied this on English data, gradually expanding our research focus towards Indian languages. Unavailability of labelled data was the first challenge we encountered while working on disfluency in Indian languages. To address this, we developed some guidelines for annotating disfluencies in Indian Languages. These annotation guidelines are aimed at having a clear and uniform procedure for generating labelled data, thus improving the ‘quality’ of data. These guidelines were used to prepare manually annotated disfluency data. However, developing large amounts of data manually for the task is time consuming and needs a lot of effort. To address this labelled-data scarcity problem, we worked on an algorithm to synthetically generate disfluencies. This algorithm aims to facilitate more effective model training for the identification of disfluencies in real-world conversations, thereby contributing to the advancement of disfluency research in Indian languages. The algorithm also attempts to take into account the code-mix nature in Indian Languages – English setting. We combine the manually annotated data with the synthetically generated data for conducting multiple disfluency classification experiments. We trained models for disfluency identification in all the above mentioned languages – having models trained on each language separately as well as varieties of multilingual models.
September 2024