[month] [year]

Aditya Srivastava – Code-Mixed NLP

Aditya Srivastava  received his MS Dual Degree in Computational Linguistics (CL). His research work was supervised by Prof. Dipti M Sharma. Here’s a summary of his research work on Neural approaches for code-mixed NLP in low-resource conditions

Code-mixing is a phenomenon of natural language where multilingual speakers mix two or more of the languages they speak, within the same utterance. This mixture is not made at random, but governed by systematic rules, and is part of highly effective communication. Code-mixing is very common in exchanges between multilinguals in day-to-day conversation, mass-media, pop-culture and on social media networks.
NLP systems have historically struggled with processing and understanding code-mixed speech due to the non-standardized nature of code-mixing in informal contexts, such as in informal speech and on social media. The lack of standardization results in extremely high variation in how code-mixing is employed, making the design and creation of rule based NLP systems impractical. The solution has been to look towards statistical systems of machine learning which can automatically learn from data. While there is a large trove of code-mixed data available on the internet, the data is extremely noisy and needs preprocessing and cleaning before it can be made viable for use as input in a machine learning pipeline and be learnt from. Due to the painstaking process of data normalization involved, resources for code-mixed language are scarce and often suffer from poor quality.

In this thesis we detail our attempts at improving results on code-mixed NLP in scenarios where the amount of data is the limiting factor. First, we develop an approach for sentiment analysis for Hindi-English code-mixing using neural networks. We describe an architecture for the hierarchical analysis of code-mixed texts at the word and the sentence level, and use it for sentiment classification. Second, we introduce a novel new dataset intended for training or finetuning code-mixed language models, with parallel translations for Hindi-English code-mixed tweets in both pure Hindi and pure English. We then describe how one can use this data to leverage the Hindi and English pretraining in multilingual models and establish a baseline for the code-mixed translation task.