December 2022
Suma Reddy Duggenpudi received her Master of Science in Computational Linguistics (CL). Her research work was supervised by Dr. Radhika Mamidi. Here’s a summary of her research work on Towards Building a Dialogue System by enhancing Named Entity Recognition in Telugu using Deep Learning and Transformers:
The latest digital revolution has caused an outburst of data. With growing data in different online platforms, it is crucial to process and streamline the data into valuable information. Extensive research is being carried out to achieve the same goal, and for achieving this, we understand that Natural Language Understanding (NLU) plays a vital role. It is considered a core component of many applications like Dialogue Systems, Question Answering, Text Classification, Automatic Text Generation, etc. However, as of today, a lot of the research in this direction is limited to the English language only. Our main aim is to extend these to other resource-poor languages like Telugu. Hence we propose a Dialogue System for Telugu that internally does Question Classification and Named Entity Recognition (NER). NER is one of the essential parts of NLU. It helps us in extracting essential phrases present in the sentence. Once we identify the essential phrases, we can classify them as various entity types and then use this information for decision-making. Our attempt is one step closer to bridging the gap between user and computer interaction in Telugu by proposing a sustainable method for creating domain-specific dialogue systems with limited data. We also improve the availability of various methods and tools that facilitate Natural Language Understanding and research of NLU in Telugu by building the state-of-the-art Named Entity Recognition for Telugu and providing sufficient annotated datasets.
As discussed above, we first propose a dialogue system for the Hospital domain in Telugu. The main aim of the dialogue system is communication with the user in spoken or written form, and they have been gaining popularity over the past decade since they are found useful and helpful for the users. The dialogue system for the Hospital domain handles various hospital and doctor-related queries. However, the idea is to present an approach for modeling a dialogue system in a resource-poor language by combining linguistic and domain knowledge. Focusing on the question-answering aspect of the dialogue system, we identified Question Classification and Query Processing as the most vital parts of this dialogue system. Our method combines deep learning techniques for question classification and rule-based computational analysis for query processing. Human evaluation of the system has been performed as there is no automated evaluation tool for dialogue systems in Telugu. The results show that our system achieves a high overall rating and a significantly accurate context-capturing method.
On the other hand, Named Entity Recognition (NER), which facilitates applications like dialogue systems, is about identifying entities in a sentence. It is a successful and well-researched problem in English due to the availability of resources. The transformer models, specifically the masked-language models (MLM), have shown remarkable performance in NER during recent times. However, with growing data and applications in all languages, there is a need for NER in other languages too. But, NER remains to be explored in Indian languages due to the lack of resources and tools. Our contributions include (i) Three annotated NER datasets for the Telugu language in multiple domains: Newswire Dataset (ND), Medical Dataset (MD), and Combined Dataset (CD). (ii) Comparison of the finetuned Telugu pretrained transformer models (BERT-Te, RoBERTa-Te, and ELECTRA-Te) with other baseline models (CRF, LSTM-CRF, and BiLSTM-CRF). (iii) Further investigation of the performance of Telugu pretrained transformer models against the multilingual models mBERT [20], XLM-R [15], and IndicBERT [45]. We find that pretrained Telugu language models (BERT-Te and RoBERTa) outperform the existing pretrained multilingual and baseline models in NER. On a large dataset (CD) of 38,363 sentences, the BERT-Te achieves a high F1-score of 0.80 (entity-level) and 0.75 (token-level). Further, these pretrained Telugu models have shown state-of-the-art performance on various existing Telugu NER datasets. We open-source our dataset, pretrained models, and code.