Kusampudi Siva Subrahamanyam Varma received his MS Dual Degree in Computer Science and Engineering (CSE). His research work was supervised by Dr. Radhika Mamidi. Here’s a summary of his research work on Towards Building Pre-processing Tools for Analysing Code-Mixed Social Media Text:
Society has been evolving rapidly with the digital revolution. With the increase of Internet adoption in the last decade, multilingualism has proliferated as the Internet has provided unrestricted access to resources. In a multilingual society, people often mix multiple languages in an informal setting leading to Code-Mixed (CM) text. The CM text generated has been increasing with time. The analysis of this data may help us to derive potentially valuable conclusions. Since data is seen as the new oil, many public and private institutions have been moving towards digital governance, focusing on data-driven decision-making. This phenomenon is reflected in the enormous increase in the number of new and existing companies established to extract various valuable sources of information. But, analysis of low-resourced CM languages is still a long shot. While there have been significant advancements in Natural Language Processing with Machine Learning, processing low-resourced CM text presents new challenges because of its noisy nature and lack of reliable data. Thus, CM requires better data sets and robust pre-processing pipelines. This work aims to analyze a new approach towards building data sets and proposes a pre-processing pipeline for low-resourced CM text. We have chosen Telugu, a low-resourced Dravidian language spoken in the Southern part of India, as the primary matrix language and English as the secondary embedded language for this study. In this thesis, we first present the largest CM dataset for Code-Mixed Telugu-English Text (CMTET), annotated with a new, highly efficient chatbot-based annotation tool. We then propose a pre-processing pipeline to help model CM data using existing NLP models. The pre-processing pipeline starts with Language IDentification (LID) on raw text. It then presents a novel data normalization technique to normalize the spelling and transliteration errors. To analyze the performance of the proposed pipeline, we carried out Sentiment Analysis on the CM dataset with a reported 2.53% increase in accuracy with the pipeline included.