ACL 2022 -

Faculty and students published the following papers at the 60th Annual Meeting of the Association for Computational Linguistics, (ACL-2022) in Dublin, Ireland from 22 – 27 May.

SyMCoM – Syntactic Measure of Code Mixing A Study Of English-Hindi Code-Mixing – Prashant Kodali; Anmol Goel; Monojit Choudhury, Microsoft Research, India; Prof. Manish Shrivastava; Prof. Ponnurangam Kumaraguru

Research work as explained by the authors: Code mixing is the linguistic phenomenon where bilingual speakers tend to switch between two or more languages in conversations. Recent work on code-mixing in computational settings has leveraged social media code mixed texts to train NLP models. For capturing the variety of code mixing in, and across corpus, Language ID (LID) tags based measures (CMI) have been proposed. Syntactical variety/patterns of code-mixing and their relationship vis-a-vis computational model’s performance is under explored. In this work, we investigate a collection of English(en)-Hindi(hi) code-mixed datasets from a syntactic lens to propose, SyMCoM, an indicator of syntactic variety in code-mixed text, with intuitive theoretical bounds. We train SoTA en-hi PoS tagger, accuracy of 93.4%, to reliably compute PoS tags on a corpus, and demonstrate the utility of SyMCoM by applying it on various syntactical categories on a collection of datasets, and compare datasets using the measure.

Full paper: https://cdn.iiit.ac.in/cdn/precog.iiit.ac.in/pubs/SyMCoM_ACL_2022.pdf

HLDC: Hindi Legal Documents Corpus – Arnav Kapoor, Anmol Goel, T H Arjun, Amul Agrawal, Prof. Ponnurangam Kumaraguru. The other authors of this paper are Mudit Dhawan, IIIT Delhi; Akshala Bhatnagar, IIIT Delhi; Vibhu Agrawal, IIIT Delhi; Arnab Bhattacharya, IIT Kanpur and Ashutosh Modi, IIT Kanpur

Research work as explained by the authors: Many populous countries including India are burdened with a considerable backlog of legal cases. Development of automated systems that could process legal documents and augment legal practitioners can mitigate this. However, there is a dearth of high-quality corpora that is needed to develop such data-driven systems. The problem gets even more pronounced in the case of low resource languages such as Hindi. In this resource paper, we introduce the Hindi Legal Documents Corpus (HLDC), a corpus of more than 900K legal documents in Hindi. The documents are cleaned and structured to enable the development of downstream applications. Further, as a use-case for the corpus, we introduce the task of Bail Prediction. We experiment with a battery of models and propose a multi-task learning (MTL) based model for the same. MTL models use summarization as an auxiliary task along with bail prediction as the main task. Results on different models are indicative of the need for further research in this area.

Full Paper: https://precog.iiit.ac.in/pubs/HLDC_ACL_2022.pdf

TeluguNER: Leveraging Multi-Domain Named Entity Recognition with Deep Transformers (under the category- Student Research Workshop) – Dr. Radhika Mamidi, Suma Reddy, Mounika Marreddy and Subba Reddy Oota, Research assistant.

Research work as explained by the authors: Named Entity Recognition (NER) is a successful and well-researched problem in English due to the availability of resources. The transformer models, specifically the masked-language models (MLM), have shown remarkable performance in NER during recent times. With growing data in different online platforms, there is a need for NER in other languages too. NER remains to be underexplored in Indian languages due to the lack of resources and tools.

Our contributions in this paper include (i) Three annotated NER datasets for the Telugu language in multiple domains: Generic Dataset (GD), Medical Dataset (MD), and Combined Dataset (CD) (ii) Comparison of the fine tuned Telugu pretrained transformer models (\emph{BERT-Te}, \emph{RoBERTa-Te}, and \emph{ELECTRA-Te}) with other baseline models (CRF, LSTM-CRF, and BiLSTM-CRF) (iii) Further investigation of the performance of Telugu pretrained transformer models against the multilingual models mBERT~\cite{devlin2018bertmulti}, XLM-R~\cite{conneau2020unsupervised}, and IndicBERT~\cite{kakwani2020inlpsuite}.

We find that pretrained Telugu language models (\emph{BERT-Te} and \emph{RoBERTa}) outperform the existing pretrained multilingual and baseline models in NER. On a large dataset (CD) of 38,363 sentences, the \emph{BERT-Te} achieves a high F1-score of 0.80 (entity-level) and 0.75 (token-level). Further, these pretrained Telugu models have shown state-of-the-art performance on various existing Telugu NER datasets.

We open-source our dataset, pretrained models, and code\footnote{\url{https://github.com/mors-ner/anonymous_telner}}.

The Association for Computational Linguistics (ACL) is the premier international scientific and professional society for people working on computational problems involving human language, a field often referred to as either computational linguistics or natural language processing (NLP). The association was founded in 1962, originally named the Association for Machine Translation and Computational Linguistics (AMTCL), and became the ACL in 1968. Activities of the ACL include the holding of an annual meeting each summer and the sponsoring of the journal Computational Linguistics, published by MIT Press; this conference and journal are the leading publications of the field. For more information, see: https://www.aclweb.org/.

Computational linguistics is the scientific study of language from a computational perspective. Computational linguists are interested in providing computational models of various kinds of linguistic phenomena. These models may be “knowledge-based” (“hand-crafted”) or “data-driven” (“statistical” or “empirical”). Work in computational linguistics is in some cases motivated from a scientific perspective in that one is trying to provide a computational explanation for a particular linguistic or psycholinguistic phenomenon; and in other cases the motivation may be more purely technological in that one wants to provide a working component of a speech or natural language system. Indeed, the work of computational linguists is incorporated into many working systems today, including speech recognition systems, text-to-speech synthesizers, automated voice response systems, web search engines, text editors, language instruction materials, to name just a few.

Website: https://www.2022.aclweb.org/