Kartikey Pant received his MS Dual Degree in Computer Linguistics (CL). His research work was supervised by Dr. Manish Shrivastava and Dr. Radhika Mamidi. Here’s a summary of Kartikey Pant’s MS thesis, Towards Enhancing Natural Language Representation for Downstream Tasks as explained by him:
Computers do not genuinely understand the natural language or have the ability to converse with humans on any arbitrary topic, resulting in a huge barrier separating us from computers. The researchers in natural language processing and understanding are driven by the impetus to break this barrier. Consequently, natural language processing techniques moved from methodologies involving complex sets of hand-written rules to machine learning and neural network-based methods. Recent works have extensively used methodologies of sequence-to-sequence models, back-translation based method, contextualized word representations, transformer-based models to enhance the ability of machines to understand natural languages. While sequence-to-sequence models enabled the computer to generate arbitrary output sequences after seeing the entire input, transformers helped in scaling up this ability to learn from an immensely large amount of text. Contextualized word representations enable machines to use pre-training and fine tuning to give better representations more efficiently. On the other hand, back-translation methodologies allow machines to train text generation models without using a parallel dataset, exploiting the natural alignment of the representations already present in both domains of the generation space. Yet, none of the machines has reached the level of understanding and conversational fluency needed to pass the Turing test. This phenomenon necessitates the need for enhancing the contemporary methodologies revolving around natural language representations to improve the ability of machines to understand natural languages. Hence, in this thesis, we attempt to address some of the commonly faced challenges in natural language processing through enhanced language representations. Firstly, we investigate the task of subjective bias detection in Wikipedia text through the adaption of optimized contextualized word representation based methodologies exploiting pretraining and fine tuning. We outperform the state-of-the-art by a margin of 5.6 in F1 score. Secondly, we explore the text classification task of predicting Disclosure and Supportiveness labels based on Twitter and Reddit datasets. Our contextualized word representation based ensemble methodology outperformed all other methods, being adjudged as the Best System in the CL-Aff Shared Task held at AAAI 2019. Thirdly, we explore the new task of fine-grained text classification of tobacco related tweets. We release the SmokEng dataset, along with a comprehensive data annotation schema. We show that our contextualized word representation based model outperforms the previous state-of-the-art by a margin of 1.8 in F1 score. Finally, we successfully accommodate attribute-control in the task of semi supervised sentiment transfer through a sentiment-based loss. We release SentiInc, a simple framework for encoding sentiment-specific information in the target sentence while preserving the content information already present in the text. Experiments performed on the widely-used Yelp dataset show that SentiInc outperforms previous state-of-the-art methods by a margin of 11% in terms of G-Score.