[month] [year]

Devansh Gautam – Sequence classification

Devansh Gautam received his MS Dual Degree in  Computer Science and Engineering (CSE). His research work was supervised by Dr. Manish Shrivastava. Here’s a summary of his research work on Translation of code-mixed text and its application in sequence classification

Code-mixing refers to the mixing of two or more languages where words from different languages are interleaved with each other in the same conversation. Code-mixing is popular in multilingual societies around the world and is commonly used in social media texts. In recent times, with the increasing popularity of social media platforms, there has been an increase in the usage of code-mixed languages. English-Hindi code-mixed language colloquially called Hinglish is very commonly used in India because of a large number of bilingual speakers who use English in their professional lives and Hindi in their personal lives. Traditional Natural language processing systems, which are usually trained on monolingual corpora, do not perform well with code-mixed texts.
Building NLP systems for code-mixed languages could help build interactive systems for the large number of people using code-mixed languages. It would also enable us to build systems that can process the large amount of code-mixed data generated on social media and make the content accessible for people and machines alike. For example, building translation systems for codemixed languages would allow people to communicate with each other comfortably. Other tasks such as sentiment analysis or toxic content detection can enable high-level analysis of code-mixed texts.
In this thesis, we propose a simple approach for machine translation of Hinglish (English Hindi code-mixed) texts to English and explore how transfer learning from Hindi to English translation task can improve the performance of our translation system. We also propose an approach for translating English data to Hinglish. We provide parallel Hindi translations along with the English sentences as input to our translation system and analyse the improvement in the performance of our system. We also propose BLEUnormalized – a modified version of the BLEU metric to evaluate Hinglish outputs which often have informal transliterations and varying degrees of code-mixing.
Further, we propose a method for sequence-level classification tasks for Hinglish data using our Hinglish to English translation system. First, we translate the Hinglish data to English, and then we use transfer learning from state-of-the-art English models for processing the translated data. We evaluate the effectiveness of our approach on the tasks of Sentiment Analysis and Natural Language Inference of Hinglish texts. We use various pre-trained models that have been finetuned for similar English-only tasks and have shown state-of-the-art performance in those. We further fine-tune these models on the translated datasets. To the best of our knowledge, we achieve state-of-the-art performance in both tasks.
Finally, we explore Natural Language Inference in a different setting, called tabular entailment, where the task is to verify a statement given a table as evidence. We propose a method for classifying whether the table supports the statement based on TAPAS. We evaluate how transfer learning and standardizing tables to have a single header row can improve TAPAS’ performance. We also propose a method for predicting which cells in the table provide evidence for/against the statement. We again use TAPAS and evaluate how different fine-tuning strategies can improve TAPAS’ performance.