[month] [year]

Kodali Prashant

Kodali Prashant supervised by Dr. Manish Shrivastava received his doctorate in Computer Science and Engineering (CSE). Here’s a summary of his research work on Computationally Code-Mixing Bahut Challenging HaiImproving Analysis & Generation of Code-Mixed Text:

As text processing systems become increasingly central to human-computer interactions—such as voice assistants, chatbots, and search engines—it is vital that these technologies support natural, flexible, and inclusive modes of communication. Language usage is far from uniform and varies significantly across regions, cultures, and communities. A particularly widespread phenomenon in multilingual societies is code-mixing, where speakers fluidly incorporate linguistic units from two or more languages within a single utterance or conversational context. This form of language usage is especially common in countries like India, where multilingualism is the norm. However, contemporary NLP systems often struggle with such data because code-mixed data is under-represented in conventional sources of text data. Therefore, there is an urgent need to develop robust NLP pipelines that can handle code-mixed input effectively and equitably. This thesis addresses key challenges in processing code-mixed text by proposing comprehensive solutions for its analysis, benchmarking, and generation. Research presented in this thesis advances the field through several core contributions. 

Syntactic Code-Mixing Metric (SyMCoM): We propose, SyMCoM, a novel metric designed to quantify syntactic code-mixing by analyzing the source language associated with each part-of-speech (PoS) tag in a sentence. This metric offers a linguistically grounded alternative to existing language ID based metrics, enabling deeper insights into the structure of code-mixed text and facilitating more nuanced analysis of corpora and systems.

 Acceptability of Code-Mixed Text: We introduce CLINE, a large-scale dataset containing English – Hindi code-mixed sentences annotated with human judgments on their acceptability. Through empirical analysis, we show that existing code-mix metrics often fail to distinguish between acceptable and unacceptable code-mixing. Further, we demonstrate that multilingual language models like XLM-Roberta and Llama, when fine-tuned appropriately, can effectively learn to model these human acceptability judgments, offering a path forward for more human-aligned code-mix evaluation. 

A Cross-Lingual Task-Oriented Dialogue Dataset for Hindi and English-Hindi: We develop and release Hindi and English–Hindi versions of a multi-domain, task-oriented dialogue dataset. This dataset supports both natural language understanding (NLU) and generation (NLG) tasks and provides a valuable benchmark for evaluating multilingual and code-mixed capabilities of large language models in realistic, task-based scenarios.

Model Merging Strategies for Code-Mixed Scenarios: To improve performance on code-mixed tasks, we explore novel strategies for adapting pre-trained models for code-mixed tasks. We also evaluate the possibility of combining monolingual and code-mixed data through model merging. Results of our study indicate that merging models offers an effective approach to adapting pre-trained models for code-mixed tasks, frequently yielding better performance than the traditional method of continued pre-training followed by fine-tuning. 

CodeMixToolkit: Finally, we present the CodeMixToolkit, a modular and extensible framework that standardizes the pipeline for working with code-mixed data. Toolkit offers tools for accessing and preprocessing data, model training, and evaluation, and supports multiple NLP tasks, with a focus on English–Hindi, but is extendable to other language pairs. This resource aims to accelerate research and development in the area by standardizing the pipeline in Code-Mixing research, lowering entry barriers, and promoting reproducibility. 

We conclude by discussing the limitations of our approaches, including challenges related to data availability, generalization across domains, and evaluation frameworks. We also outline future directions for advancing code-mixed NLP, such as improved representation learning, transfer learning techniques, and deeper linguistic analysis. This thesis presents a comprehensive framework for processing codemixed languages, enabling the development of NLP systems that are capable of processing code-mixed text, and better suited to the needs of multilingual users.

 September 2025