[month] [year]

R Priyanka – Telugu Systems & Resources

Ravva Priyanka received her Master of Science in  Computer Science and Engineering (CSE).  Her research work was supervised by Dr. Manish Srivastava. Here’s a summary of her research work on Systems and resources for Telugu: Question answering and summarization:

Natural language processing (NLP) is a bridge between the computer and human interactions in their natural language. NLP has a wide variety of applications such as machine translation, text summarization, question-answering, sentiment analysis, etc. All these applications have created a huge impact in society with different use cases such as chatboats, voice assistants (Alexa, Siri), recommendation systems (Youtube, Hotstar) etc. Building these NLP applications requires a large amount of processed text data and computational resources. Most of these NLP applications are limited to few high resource languages like English. Notably, in an Indian scenario, where each state has its own language, only 10% of people communicate in English.

India is a multilingual country having more than 1500 languages in which 22 are considered as official languages. The majority of these languages are derived from two language families: Dravidian and Indo-Aryan. The south Indian languages originated from the Dravidian family and north Indian languages originated from the Indo-Aryan family. Most of the Indian languages are considered as low resource languages in the NLP community, which means these languages have limited or no processed text data. This motivates us to create language specific NLP resources and systems for Indian languages. In this thesis, we mainly focus on south Indian language Telugu. The Telugu language has more than 80 million native speakers. So If we develop the NLP systems in Telugu language it will benefit the larger community. In this work, we created question-answering (QA) and text-summarization resources and systems in Telugu language.

The main aim of the QA system is to provide an accurate and concise answer to the question asked by humans in natural language. We created a question classification dataset which consists of 1037 samples and also explained the ambiguities, challenges involved in creating the dataset. We built the end-to-end pipeline for the QA system and named it as AVADHAN. We performed comparative analysis between three different classifiers for the Telugu Question Classification (QC) module. QC will be helpful to reduce the search space while extracting the answer for the given query.

Text summarization is a way of obtaining a short and precise summary from the given document of arbitrary length. We proposed a pipeline that crowd-sources summarization data and then aggressively filters the content with automatic and partial expert evaluation. With this pipeline we have created TeSum: high quality human generated abstractive summarization corpus for Telugu. This corpus consists of 20329 high quality article-summary pairs and this is the first high quality and large abstractive summarization corpus in Telugu as per our knowledge. We performed the quality assessment on existing summarization datasets and showed quality statistics of each.

To perform the automatic summarization task on TeSum corpus, we implemented the sequence-to-sequence Recurrent Neural Networks (RNN) model with attention mechanism and pointer-generator with coverage mechanism. Further, we used the novel intra-attention mechanism with reinforcement learning (RL). Explored with a novel document-level encoder using Bidirectional Encoder Representations from Transformers (BERT), it can be used for both extractive and abstractive summarization. We finetuned the multilingual text to text transfer transformer (mT5) with TeSum corpus. For all these baseline models provided the ROUGE scores for Telugu abstractive summarization(TeSum).

  • We addressed the challenges involved in creation of resources and systems for low resource language Telugu.
  • The QA system ‘AVADHAN’ can apply to other languages by choosing the language specific question classification module and also can further implement a multilingual question answering system.
  • The proposed novel guidelines for creation and evaluation of summarization dataset which can be useful for creating high quality abstractive summaries, and can apply to other languages.
  • The evaluation quality assessment process can be applied for scraped dataset to maintain high-quality that deserve the abstractive summarization properties.
  • Provided baseline numbers for Telugu abstractive summarization with deep neural network methods such Pointer-Generator, MLE + Reinforcement Learning(RL), BERTSUM, and multilingual text-to-text transfer transformer (mT5) models which are commonly used for summarization.
February 2023