[month] [year]

Mounika M – Text Classification for Telugu

Mounika Marreddy received  her  doctorate in Computer Science and Engineering (CSE). Her research work was supervised by Dr. Radhika Mamidi. Here’s a summary of her research work on Text Classification for Telugu: Datasets, Embeddings and Models for Downstream NLP Tasks:

Language understanding has become crucial in different text classification tasks in Natural Language Processing (NLP) applications to get the desired output. Over the past decade, machine learning and deep learning algorithms have been evolving with efficient feature representations to give better results. The applications of NLP are becoming potent, domain, and language-specific. For resource-rich languages like English, the NLP applications give desired results due to the availability of large corpora, different kinds of annotated datasets, efficient feature representations, and tools. Due to the lack of large corpora and annotated datasets, many resource-poor Indian languages struggle to reap the benefits of deep feature representations. Moreover, adopting existing language models trained on large English corpora for Indian languages is often limited by data availability, rich morphological variation, syntax, and semantic differences. Most of the work being done in Indian languages is from machine translation perspective. One solution is to use translation method for re-creating datasets in low resource languages from English. But in case of Indian languages like Telugu, the meaning may change and some crucial information may be lost due to translation. This is because of the structural differences, morphological complexities, and semantic differences between the two languages. In this thesis, our main objective is to mitigate the low-resource problem for Telugu. Overall, to accelerate the NLP research in Telugu, we present several contributions: (1) A large Telugu raw corpus of 80,15,588 sentences (16,37,408 sentences from Telugu Wikipedia and 63,78,180 sentences crawled from different Telugu websites). (2) A TEL-NLP annotated dataset in Telugu covering four NLP tasks (16,234 samples for sentiment analysis (SA), hate-speech detection (HS), and sarcasm detection (SAR), and 9,675 samples for emotion identification (EI)). (3) For the Telugu corpus, we are the first to generate word and sentence embeddings using graph-based models: DeepWalk-Te and Node2Vec-Te, and Graph AutoEncoders (GAE). (4) We propose the multi-task learning model (MT-Text GCN) to reconstruct word-sentence graphs on TEL-NLP data while achieving multi-task text classification with learned graph embeddings. (5) We extended samples in the annotated dataset (35,142 sentences in each task) for multiple NLP text classification tasks such as SA, EI, HS, SAR. (6) We create different lexicons for sentiment, emotion, and hate-speech for improving the efficiency of the machine learning models. (7) From scratch, we pre-trained distributed word and sentence embeddings such as Word2Vec-Te, vi vii GloVe-Te, FastText-Te, MetaEmbeddings-Te, Skip-Thought-Te. (7) We pre-trained different contextual language models for Telugu such as ELMo-Te, BERT-Te, RoBERTa-Te, ALBERTTe, Electra-Te, and DistilBERT-Te trained on 80,15,588 Telugu sentences. Further, we show that these representations significantly improve the performance of four text classification tasks and present the benchmark results for Telugu. (8) We developed a new annotated dataset of 112,657 Telugu clickbait and non-clickbait headlines that can be a key resource for building automated clickbait detection systems in Telugu. (9) We develop a benchmark system for detecting clickbait headlines written in Telugu by investigating a wide range of features from traditional to state-of-the-art representations. We argue that our pre-trained embeddings are competitive or better than the existing multilingual pre-trained models: mBERT, XLM-R, and IndicBERT. Lastly, the fine-tuning of pre-trained models show higher performance than linear probing results on five NLP tasks. We also experiment with our pre-trained models on other NLP tasks available in Telugu (Named Entity Recognition, Article Genre Classification, Sentiment Analysis, and Summarization) and find that our Telugu pre-trained language models (BERT-Te and RoBERTa-Te) outperform the state-of-the-art system except for the sentiment task. We hope that the availability of the created resources for different NLP tasks will accelerate Telugu NLP research which has the potential to impact more than 85 million people. In this thesis, we aim at bridging the gap by creating resources for different NLP tasks in Telugu. These tasks can be extended to other Indian languages that are closer to Telugu culturally and linguistically by translating these resources without losing information like verb forms, cultural terms, vibhaktis etc. This is the first work that employs neural methods to Telugu language —a language that does not have good tools like NER, parsers and embeddings. Our work is the first attempt in this direction to provide good models in Telugu language by exploring different methods with available resources. It can also help the Telugu NLP community evaluate advances over more diverse tasks and applications. We open-source our corpus, five different annotated datasets (SA, EI, HS, SAR, and clickbait), lexicons, pre-trained embeddings, and code here1 . The pre-trained Transformer models for Telugu are available here2.

  1. https://github.com/mounikamarreddy/NLP-for-Telugu-Language.git 
  2. https://huggingface.co/ltrctelugu
  •  
  •  
  •  
  •  
  •  
  •  
  •  
  •