IIITH Researchers Develop first-of-its-kind Hinglish Code-Mixed Data NLP tool

Researchers from the Language Technologies Research Centre (LTRC), IIITH make a first-of-its-kind attempt at semantic role labelling of Hindi-English code mixed tweets. This was presented at the Linguistic Annotation Workshop during ACL 2019, Italy.

Chalo jaldi karo, or we’ll miss the beginning of the movie. Nothing out of the ordinary in this sentence here. In this case, the sentence is a mix of English with Hindi or Hinglishas it is more popularly known. This form of code-mixing or “the embedding of linguistic units such as phrases, words and morphemes of one language into an utterance of another language” happens all the time in multilingual countries such as ours. And if such mixing is rampant in speech, can its usage in social media be far behind?

Speaking of research in this area that is going on in the Language Technologies Research Centre (LTRC) at IIITH, Prof. Dipti Misra Sarma says, “We speak in code-mixed language and if we want to develop computational models for processing natural language, we also need to handle it. The modern world of social media has lots of that (code-mixed language). So if we want to do any parsing or analysis there, the first thing we have to do is handle the code-mixed data. That was our initial interest and motivation and that’s why our lab started working on it sometime back.”

Understanding Hinglish

Prof. Dipti Misra and her student Riya Pal have been conducting research on semantic role labelling of Hinglish tweets particularly. Explaining her work, Riya begins by stating that code-mixed data in particular was fascinating for her. “When I’m talking in general, I’m not really following a single language or following formal Grammar,” she says. Riya who is a final-year dual degree student says that semantic role labelling is used in any case that needs accurate understanding of text before it is extracted. “Take the example of a chatbot. It needs to understand data first before extracting information from it. Let’s say there is a sentence like, I go to school. For a machine to understand this, it extracts the action “go”, “who” is going and “where”. Thus, there are labels in that sentence.” Riya went on to create a dataset for Hinglishdata where she labelled about 1500 tweets manually. The next step in the future will be to create an automated tool using machine learning techniques that would do the same, thereby improving accuracy of the model. “It’s the first attempt so far at creating an NLP tool for Hinglish data,” she says. Her dataset included colloquialisms such as the short-hand way of writing words or even typos. For example, Lalu Yadav claimed that Yadav quota ke hisab se Umesh Yadav ko ye wkt mil jana chahiye tha. The “wkt” could really stand for either wicket which we know from the context of Umesh Yadav being an Indian cricketer. Or it could stand for “waqt” meaning “time” in Hindi, since this is a English-Hindi code-mixed tweet. There were obvious challenges in overcoming such ambiguities.

Linguistic Annotation Workshop

This research titled “A Dataset for Semantic Role Labelling for Hindi-English Code-Mixed Tweets” was presented at the 13th Linguistic Annotation Workshop held alongside the prestigious annual conference by the Association of Computational Linguistics (ACL) in Florence, Italy this month. “I encourage and send students to these conferences because for them interaction is very important. They get to meet people who are actively working in the field and get direct feedback from them on their work. That is a very important experience for anyone as a researcher,” says Prof. Misra. Echoing her mentor’s views, Riya says that as the first international conference she attended, it definitely gave her a boost of confidence. “I interacted with people across the world who actually recognised previous work done by my seniors. It obviously felt good to get that sort of validation. I got a lot of feedback as well which will get incorporated into our future work,” she affirms.

Future: Code-Mixing Three Languages

Code-mixing as a field of Theoretical Linguistics has existed since the ‘60s. “It exists in societies that are bi-lingual or multi-lingual. For computational modelling, it was not studied say, 10 years ago or so, but now people are getting interested in it,” remarks Prof. Misra. Students at the LTRC are currently working on English-Telugu, English-Hindi, with a recent addition of English-Bengali code-mixing. “ We have three language pairs handled over here. If there’s a student who is interested in and knows another Indian language, then yes, definitely we’ll encourage it. So far when we’ve talked about code mixing, we’ve only talked about mixing two languages. But in code mixing, even three languages are mixed. Here when I hear people speaking on the streets, there’s Hindi, Telugu and English. And they mix it very naturally and commonly. So what we would like to do as the next step is maybe go in that direction – to be able to handle more than two languages,” signs off Prof. Misra.



Sarita Chebbi is a minimalist runner, practising yogi and baker of all things whole-wheat, and sugar-free. Currently re-learning her ABC’s…the one that goes: A for algorithm, B for Bayesian, C for convolutional (neural network)….

Leave a Reply

Your email address will not be published. Required fields are marked *

Next post