Aravapalli Akhilesh -

Aravapalli Akhilesh supervised by Dr. Radhika Mamidi received his Master of Science – Dual Degree in Computational Linguistics (CLD). Here’s a summary of his research work on Unlocking Linguistic Insights and Knowledge Accessibility: Advancing NLP for Low-Resource Indian Languages through Model Probing and Content Generation:

The rapid advancement of Transformer-based models has transformed natural language processing (NLP), yet their application to low-resource Indic languages remains underexplored, limiting equitable access to digital knowledge. This thesis investigates the encoding capabilities and robustness of multilingual Transformer models for Indic languages, culminating in practical NLP applications to enhance knowledge accessibility. By addressing linguistic diversity and digital inclusion, the study aligns with the urgent need to bridge information gaps for millions of Indic language speakers, particularly in India, where regional languages like Telugu face significant resource constraints. Our work begins by introducing IndicSentEval, a novel dataset with approximately 47,000 sentences across six Indic languages—Hindi, Telugu, Marathi, Kannada, Urdu, and Malayalam. This dataset enables a comprehensive probing analysis of nine multilingual Transformer models (seven universal, two Indic-specific) across eight linguistic properties, including surface, syntactic, and semantic features. The findings reveal that while models encode English properties consistently, performance for Indic languages varies, with Indic-specific models outperforming universal ones. Extending this analysis, our study evaluates model robustness against 13 perturbations, such as word dropping and shuffling, finding that universal models exhibit greater resilience, particularly under noun and verb perturbations. These insights, detailed in Chapters 3 and 4, highlight model-specific strengths and weaknesses, informing robust NLP system design. Moreover, the development of IndicSentEval and the resulting insights into model behavior underscore the critical role of digitized linguistic data in enabling effective downstream NLP tasks for low-resource languages. Building on these findings, the thesis develops a practical application to enhance Telugu Wikipedia, addressing the knowledge access gap for 95.7 million Telugu speakers. Leveraging NLP techniques like translation, transliteration, and template generation, the study generates 8,929 high-quality movie-domain articles, adhering to Wikipedia’s standards for word count, images, and infoboxes. This contribution, detailed in Chapter 5, demonstrates the real-world impact of robust NLP systems, fostering digital inclusion for low-resource language communities.

July 2025