December 2022
Payal Khullar received her doctorate in Computational Linguistics (CL). Her research work was supervised by Dr. Manish Shrivastava. Here’s a summary of her research work on Finding the Heads of Headless NPs, and Why:
Ellipsis is deemed important for downstream Natural Language Processing (NLP) tasks that handle data with ellipsis, such as Information Retrieval (IR), event extraction, Machine Translation (MT), Dialogue systems, Summarization, etc. The simplest and most straightforward way to confirm this is to add ellipses resolution as an additional step to the existing NLP system pipelines and measure the change in their performance. However, previous computational work on ellipsis resolution is not only limited, but has also majorly focused on only one type of ellipsis, namely Verb Phrase Ellipsis (VPE) and some phenomena closely related to it such as sluicing, gapping, etc. In other words, the computational support for other ellipses is even more sparse. Another major challenge this field is facing is the lack of use of existing theoretical understanding of this phenomenon in linguistics, which as I find out, has made the task harder than it really is.
This thesis presents the first computational treatment to another major form of ellipses namely–noun ellipses, (also referred to as noun phrase ellipses and nominal ellipses in some linguistic textbooks), drawing from linguistic features of the phenomenon. I present several procedures for the detection and resolution of noun ellipsis in English, as well as the motivation behind and the impact of doing so. The computational experiments draw heavily from the syntactic analysis of noun ellipses in traditional and more recent theoretical linguistic studies. I begin with a rule based approach– where I exploit existing syntactic structural analysis of different types of ellipses in general and noun ellipses in particular, so as to identify cues for its detection and resolution. These result into novel syntactic rules, which are then later represented and optimized as manual features in a supervised Machine Learning framework.
One of the challenges of using linguistic theory of ellipsis is the discrepancy in the definition of the phenomenon used in previous work. Traditionally, many linguists have discussed noun ellipsis exhibiting two strategies in English, namely true lexical elision and one-anaphora. However, more recent linguistic analysis shows they are completely different phenomena. This claim finds support from linguistic analysis of the word taking into account its morphological, syntactic and semantic properties, and is backed by statistical insights on its occurrence and distribution from data-driven studies carried on several popular English corpora.
In this thesis, I use this linguistic analysis to modify the system for the detection and resolution of noun ellipsis presented earlier and successfully handle the detection and resolution of one-anaphora in English with high accuracy. The investigation and the computational experiments that lead towards it highlight the importance of linguistic analysis in NLP research. Apart from presenting end-to-end pipelines for noun ellipses and noun anaphora resolution that outperform state-of-the-art models significantly, another major contribution of this work will comprise novel findings on the impact of noun ellipses (along with verbal and clausal ellipses) as well as the anaphoric instances of the word one on Machine Translation (MT). I also test the efficacy of a simple resolution procedure as a preprocessing step for improving the performance of English-Hindi NMT–a simple syntactic procedure that can be extended to other language pairs and can be easily incorporated into an existing model without making any modifications in the model itself. Finally, there are some byproducts of this work, including two huge hand annotated corpora for the analysis of one-anaphora and noun ellipsis in English, one small curated dataset containing several instances of noun ellipses, and a parallel testset to aid research in English-Hindi MT.