Anmol Goel, supervised by Prof. Ponnurangam Kumaraguru received his Master of Science in Computer Science and Engineering (CSE). Here’s a summary of his research work on Beyond the surface: A computational exploration of linguistic ambiguity:
The issue of ambiguity in natural language poses a significant challenge to computational linguistics and natural language processing. Ambiguity arises when words or phrases can have multiple meanings, depending on the context in which they are used. In natural language processing, addressing the challenge of ambiguity is crucial for building more accurate and effective language models that can better reflect the complexity of human communication. In this thesis, we investigate two specific forms of linguistic ambiguities polysemy, which is the multiplicity of meanings for a specific word, and tautology, which are seemingly uninformative and ambiguous phrases used in conversations. Both phenomena are widely-known manifestations of linguistic ambiguity- at the lexical and pragmatic level, respectively. The first part of the thesis focuses on addressing this challenge by proposing a new method for quantifying the degree of polysemy in words, which refers to the number of distinct meanings that a word can have. The proposed approach is a novel, unsupervised framework to compute and estimate polysemy scores for words in multiple languages, infusing syntactic knowledge in the form of dependency structures. The framework adopts a graph-based approach by computing the discrete Ollivier Ricci curvature on a graph of the contextual nearest neighbours. The effectiveness of the framework is demonstrated by significant correlations of the quantification with expert human-annotated language resources like WordNet. The proposed framework is tested on curated datasets controlling for different sense distributions of words in three typologically diverse languages- English, French, and Spanish. The framework leverages contextual language models and syntactic structures to empirically support the widely held theoretical linguistic notion that syntax is intricately linked to ambiguity/polysemy. This work was presented as a full paper at EMNLP 2022. The second part of the thesis explores how language models handle colloquial tautologies, a type of redundancy commonly used in conversational speech. Colloquial tautologies pose an additional challenge to language processing, as they involve the repetition of words or phrases that may appear redundant, but convey a specific meaning in a given context. We first present a dataset of colloquial tautologies and evaluate several state-of-the-art language models on this dataset using perplexity scores. We conduct probing experiments while controlling for the noun type, context and form of tautologies. The results reveal that BERT and GPT2 perform better with modal forms and human nouns, which aligns with previous literature and human intuition. Wehope this work bolsters further research on ambiguity in language models. Our contributions have important implications for the development of more accurate and reliable natural language processing systems. In conclusion, this thesis highlighted the shortcomings of existing research on linguistic ambiguities and proposed various solutions to overcome them. The incorporation of syntax and geometry into polysemy quantification is a novel contribution that demonstrates the effectiveness of syntactically motivated methods. The study on colloquial tautologies sheds light on the capability of pretrained language models in handling tautological constructions and the factors influencing them.
July 2023