[month] [year]

Sahil B – Discourse Parsing and Connective

August 2022

Sahil  Bakshi received his MS Dual Degree in Computational Linguistics (CL). His research work was supervised by Prof. Dipti M Sharma. Here’s a summary of his research work on  Towards Discourse Parsing and Connective Identification in Hindi:

Discourse parsing is a sub-field of natural language processing which involves understanding the structure, information flow, and modeling the coherence of a given text. It forms the basis of several natural language processing tasks, including, but not limited to, question-answering, text summarization, and sentiment analysis. One of the fundamental tasks in discourse parsing is discourse unit segmentation and connective identification. Discourse unit segmentation refers to identifying the elementary units of text that combine to form a coherent text. Connectives signal the presence of explicit discourse relations in text. Connective identification is the task of identifying these discourse connectives.

Language has always played a significant role in human interaction and the evolution of society. With the increasing amount of text data being generated every day on social media platforms such as Facebook, Twitter, WhatsApp, Reddit, etc., helping machines understand and analyse this data is going to be the fundamental task which will further enable us to improve the performance of systems for downstream NLP tasks. In this thesis, we explore the sub-field of shallow discourse parsing, compare approaches towards segmentation and connective identification, and build a dataset and connective identification system for Hindi data.

First, we look at approaches towards shallow discourse parsing to identify individual discourse relations that are present in text. This involves given a text, identifying the span of the explicit discourse connective, labelling the two text spans that act as the arguments of the connective and predicting the sense of the discourse relation. We compare and analyse several approaches for these tasks. We then look at an approach towards shallow discourse parsing in Hindi and analyse the tasks of the identification of explicit discourse connectives and their arguments.

Further, we work on building a multilingual model for discourse unit segmentation and connective identification. Early approaches towards segmentation and connective detection relied on rule-based systems using POS tags and other syntactic information to identify discourse segments. Recently, transformer based neural systems have shown promising results in this domain. We establish a baseline using a bidirectional LSTM model. We then look at transformer based neural systems and train our model on 16 datasets encompassing 11 languages and 3 discourse annotation frameworks. This model gives state of the art performance for the English dataset. We then present a curated dataset and model for connective identification in Hindi. We experiment with different Indian language specific models and compare and analyse the model performance.