November 2022
Sumukh S working with Dr. Manish Shrivastava presented his research work on Kanglish alli names! Named Entity Recognition for Kannada-English Code-Mixed Social Media Data at the 8th Workshop on Noisy User-generated Text (W-NUT) co-located with the 29th International Conference on Computational Linguistics (COLING-2022), held at Gyeongju, Republic of Korea from 12 – 17 October. Sumukh presented his work virtually as an oral presentation and in a poster session for Q&A session. This work opens up downstream applications such as information retrieval, question answering, machine translation for Kannada-English code mixed data.
Research work as explained by the authors:
Code-mixing (CM) is a frequently observed phenomenon on social media platforms in multilingual societies such as India. While the increase in code-mixed content on these platforms provides good amount of data for studying various aspects of code-mixing, the lack of automated text analysis tools makes such studies difficult. To overcome the same, tools such as language identifiers and parts of-speech (POS) taggers for analysing code-mixed data have been developed. One such tool is Named Entity Recognition (NER), an important Natural Language Processing (NLP) task, which is not only a subtask of Information Extraction, but is also needed for downstream NLP tasks such as semantic role labeling. While entity extraction from social media data is generally difficult due to its informal nature, code-mixed data further complicates the problem due to its informal, unstructured and incomplete information. In this work, we present the first ever corpus for Kannada-English code-mixed social media data with the corresponding named entity tags for NER. We provide strong baselines with machine learning classification models such as CRF, Bi-LSTM, and Bi-LSTM-CRF on our corpus with word, character, and lexical features.
Full Paper: https://aclanthology.org/2022.wnut-1.17/
The WNUT workshop focuses on Natural Language Processing applied to noisy user-generated text, such as that found in social media, online reviews, crowdsourced data, web forums, clinical records and language learner essays. The workshop has accepted 25 papers (short and long). See the program schedule for further details.
Workshop page: http://noisy-text.github.io/2022/
COLING, the International Conference on Computational Linguistics, is one of the premier conferences for natural language processing and computational linguistics.
First established in 1965, the biennial COLING conference is held in diverse parts of the globe and attracts participants from both top-ranked research centers and emerging countries. Today, the most important developments in this area are taking place not only in universities and academic research institutes but also in industrial research departments including tech-startups. COLING provides opportunities for all these communities to showcase their exciting discovery.
Conference page: https://coling2022.org/