[month] [year]

Akshay Goindani – Low resource languages

Akshay Goindani received his MS Dual Degree in  Computer Science and Engineering (CSE). His research work was supervised by Dr. Manish Shrivastava. Here’s a summary of his research work on Neural machine translation for low resource languages:

Machine Translation is the task where a machine generates a sentence in language T, given an input sentence in language S. The generated and input sentences must have similar semantics. In the current literature, there are various methods proposed to solve the task. Deep neural network based translation models (Neural Machine Translation (NMT)) have been shown to achieve state-of-the-art performance for multiple language pairs. However, neural methods struggle to perform well for low-resource languages. Low-resource languages are understudied, suffer from the data scarcity problem, and lack efficient language processing tools. Since neural methods require large amount of data for training effective models, less-complex statistical methods outperform these methods.

Transformer-based NMT models have achieved state-of-the-art performance for various languages. Multiple parallel attention mechanisms that use multiple attention heads facilitate greater performance of the Transformer model for
various applications e.g., NMT), text classification. However, transformer models do not perform well in low-resource conditions. In this thesis, we propose a novel Dynamic Head Importance Computation Mechanism (DHICM), which enhances the performance of the transformer model, especially, in low-resource conditions. In multi-head attention mechanism, different heads attend to different parts of the input. However, the limitation is that multiple heads might attend to the same part of the input, resulting in multiple heads being redundant. Thus, the model resources are under-utilized. One approach to avoid this is to prune least important heads based on certain importance score. We focus on designing a head importance computation method to dynamically calculate
the importance of a head with respect to the input. Our insight is to design an additional attention layer together with multi-head attention, and utilize the outputs of the multi-head attention along with the input, to compute the importance for each head. Additionally, we add an extra loss function to prevent the model from assigning same score to all heads, to identify more important heads and improvise performance. We analyzed performance of DHICM for NMT with different languages. Experiments on different datasets show that DHICM outperforms traditional Transformer-based approach by large margin, especially, when less training data is available.
Code mixing is a phenomena of mixing two or more languages, prevalent in multilingual communities. In recent years, due to the prevalence of social media platforms, there has been a surge in the usage of code-mixed texts.

Code-mixed texts are informal in nature, and do not necessarily follow pre-defined syntactic structures. Due to the informal nature, traditional NLP systems for monolingual languages, do not perform well for code-mixed inputs. Moreover, good quality code-mixed data is scarce, hence, it is challenging to train effective NLP systems for code-mixed languages. In this thesis, we focus on two major tasks – translation and generation of code-mixed languages. Specifically, we focus on the translation and generation of Hindi-English code-mixed (Hinglish) sentences. Translating code-mixed sentences to monolingual sentences would help to bridge the communication gap between different societies. Generating natural-looking code-mixed data would help to build more efficient NLP systems
for various tasks such as sentiment classification, question-answering, etc. Machine Translation systems capable of handling code-mixed inputs have been proposed mainly for creating synthetic code-mixed data. High performing
code-mixed MT system have utility in downstream tasks – conversational agents, leveraging monolingual NLU pipelines, among others. In text or in speech events, code-mixed utterances are often intermingled with monolingual
sentences. Previous approaches have proposed models that are fine tuned for code-mixed settings, and their performance on monolingual utterances is not reported or analysed. Direction of translation i.e code mixed sentence to monolingual counterpart or vice versa, impacts the utility of the models. In this thesis, we propose models – a) many-to-one system – source can be any among English, Hindi, Hinglish and target is English; b) bidirectional system – source and target can be any among English, Hindi, Hinglish. We demonstrate that in many-to-one and bidirectional settings, monolingual parallel corpora is highly useful, and to this end, we analyse the zero-shot and few-shot capabilities of models trained using monolingual corpora. We show that the model trained using the many-to-one setting performs well for both code-mixed and monolingual inputs. We also show that the proposed bidirectional model is able to generate more natural-looking code-mixed sentences as compared to existing code-mixed datasets. Quality and construction of the code mix MT datasets are of crucial importance due to the inherent variety in code-mixing , which also lends to the credibility of the reported results. In this thesis we characterize existing code mixed en-hi MT datasets, and our qualitative error analysis profiles samples which are mistranslated, providing insights for construction of quality code-mixed MT datasets.

  •  
  •  
  •  
  •  
  •  
  •  
  •  
  •