[month] [year]

Koushik Reddy Sane – Multilingual Societies

Koushik Reddy Sane  received his MS Dual Degree in Computational Linguistics (CL). His research work was supervised by Dr. Radhika Mamidi. Here’s a summary of his research work on Applications and Resources for Understanding People in Multilingual Societies:

Research in Natural Language Processing is expanding in multiple languages and is seeping into all domains of life with time. The variety of text which can be processed is growing with every advancement. Online social media platforms have become a rich source of textual data for several NLP tasks. One such social media platform is Twitter. In the present day, tweets on Twitter have become vital for understanding humans. They should be analyzed to obtain more relevant information, such as sentiments and emotions. On the other hand, the massive growth in communications on these sites by users across the world has its ill effects. Many individuals are taking undue advantage of these platforms to post aggressive and hateful content on other individuals and groups. The amount of content that resembles this form has increased beyond the scope of manual filtering for any research group/organization. Hence, the availability of automatic identification systems and annotated datasets in this domain is of utmost importance. From many years, a phenomenon called “code-mixing” has attracted a lot of research and interest from sociolinguists. According to Wikipedia, Telugu is the fifteenth most-spoken language in the world. Telugu is observed to be mostly written in Roman script on social media platforms. Code-mixed Telugu-English content is growing digitally by people residing across the globe. This thesis focuses on detecting two elements, i.e., aggression, and hate speech from code-mixed Telugu-English online posts, which are two different text classification problems. Currently, there are no publicly available datasets in Telugu-English for these tasks. In our attempt to create data resources for these tasks, this work presents a corpus of 3677 and 3361 tweets annotated with aggression and hate speech, respectively. These datasets are of great value as all annotations are done manually by two annotators, which makes them very rich for the research community. In this thesis, we describe the creation and annotation process, data, and the possible uses of the datasets. We have also built a baseline classification system to identify hate speech and aggression using the corpus created. The annotations have been used for training and testing the classification models. This work gave us an opportunity to work on code-mixed Telugu-English data, which is a relatively less explored field.