[month] [year]

S Aggarwal – Exploiting Indian Languages

November 2022

Salil Aggarwal received his Master of Science – Dual Degree in  Computer Science and Engineering (CSE).  His research work was supervised by Dr. Radhika Mamidi . Here’s a summary of his research work on Exploiting Indian Languages’ Similarity for Different NLP Applications:

INTERNET is the most familiar word from everyone’s mouth nowadays. Children or adults of all ages are utilizing it in some way. Major change has been seen from communicating with pigeons and then moving on with telephone to voice-over IPs. The Internet connects smaller businesses to larger ones and makes communication faster, reliable and accessible. Every second, the world creates an incredible amount of data, out of which the majority is unstructured. It has become pivotal to understand this vast amount of unstructured, raw data to get innovations and data-driven decisions. But the question that remains is what gives coherence to this raw data? The answer to this is through Natural Language Processing (NLP). In simple terms, NLP refers to the capability of computers to decipher human speech or, as it is spoken or written to text. NLP is a part of Artificial Intelligence that causes the systems to understand, grasp, and interpret how these systems communicate with human languages. It includes teaching computers how to make sense of the natural language humans use and to derive meaning from it. Many algorithms work on this aspect and train machines to understand natural language. Most of these algorithms are based on Machine learning, in which the computer program is trained using a large amount of data about the language to understand the language with some accuracy. This work will mainly focus on improving two NLP applications for Indian languages, viz Machine Translation and Text Classification. India is a multilingual country with thousands of languages and regional dialects spoken by its citizens. Some of them are lost, while others are still in use with significant speakers. Most of the Indian languages are related to each other due to shared ancestry or prolonged contact, and they share a lot of linguistic and structural features due to this relatedness. So, it is essential to utilize this language relatedness to build efficient NLP systems. We make use of relatedness between Indian languages and attempt to quantify their similarity. In the process of similarity score calculation, we examine different factors (script, string metric, size and type of parallel corpus, etc.) that can affect that similarity value. We also try to exploit language relatedness to train efficient multilingual NMT systems (one-to-many and many-to-one) and multilingual text classification systems for languages of the Indian subcontinent. We also focused on contributing to resource creation and annotation strategy for low-resource domains for Indian languages, further motivating researchers to curate new datasets for low-resource fields with little or no annotated data.