Prashant Kodali, working with Prof. Manish Shrivastava and Prof. Ponnurangam Kumaraguru, was awarded the Microsoft Research India (MRI) Ph.D Award 2024 by Microsoft Research India for his research work on Code-mixing of Indian Languages. Prashant is one of only 10 Ph.D students nationwide to receive this award.
The team’s research focuses on enhancing the performance of Language Models in code-mixed settings, specifically for Hindi-English code-mixing. Code-mixing, a phenomenon where bilingual speakers switch between languages within a conversation, is prevalent in multilingual societies like India. This poses unique challenges for natural language processing models due to its informal nature and syntactical complexity.
The team’s research focuses on enhancing the performance of Language Models in code-mixed settings, specifically for Hindi-English code-mixing. Code-mixing, a phenomenon where bilingual speakers switch between languages within a conversation, is prevalent in multilingual societies like India. This poses unique challenges for natural language processing models due to its informal nature and syntactical complexity.
Their primary research interest lies in the computational analysis and generation of code-mixed text by leveraging synthetic code-mixed data resources. By using the notion of “naturalness” or “acceptability” as a quality control measure, their work spans data resource creation, analysis tools, and modelling for English-Hindi code-mixing. Some key findings of the research include:
- Human Acceptability Judgements for Code-Mixed Text: Team constructed a dataset named Cline, containing human acceptability judgments for English-Hindi code-mixed text. This dataset, the largest of its kind with 16,642 sentences, highlights that popular code-mixing metrics have low correlation with human judgments, emphasizing the importance of “naturalness” in curating data resources. His work demonstrates that models like XLM-Roberta and Bernice outperform IndicBERT and even ChatGPT in certain configurations, indicating significant potential for improving code-mixed tasks.
- Syntactic Variety in Code-Mixed Text: In their earlier research, they proposed SyMCoM, an indicator of syntactic variety in code-mixed text, based on a syntactic analysis of English-Hindi datasets. This metric effectively measures syntactic variety, aiding in the comparison and evaluation of code-mixed corpora and improving the training of more robust NLP models.
- Multilingual Benchmark for Task-Oriented Dialogue Systems: As part of a multi-university collaborative effort, Prashant contributed to the development of X-RiSAWOZ, a multilingual benchmark dataset for task-oriented dialogue systems. This work involved creating a toolset for accelerating post-editing and establishing strong baselines for training dialogue agents in zero- and few-shot settings.
Research contributions, including the development of datasets like Cline and X-RiSAWOZ, along with novel metrics and methodologies, provide valuable resources for the research community. Current work on parameter-efficient fine-tuning methods and rigorous evaluation frameworks will help identify and address the weaknesses of current models, paving the way for more inclusive and capable NLP systems.
By enhancing LLMs for non-English languages, particularly non-standard language manifestations like code-mixing and romanization of native scripts, he seeks to contribute to the development of robust, multilingual models that perform well across diverse linguistic contexts.
June 2024