Akshat Chhajer supervised by Dipti Mishra Sharma received his Master of Science – Dual Degree in Computational Linguistics (LCD). Here’s a summary of his research work on DomainTrans – Machine Translation for Domain Specific Data in Indian Languages:
Machine Translation (MT), the automated process of translating text or speech between languages using computational linguistic algorithms, has drastically transformed global communication. By breaking down language barriers, MT has enabled seamless collaboration across different social, cultural, and geographical backgrounds. Although several decades of research have produced state-of-the-art models like Google Translate, these systems often falter when translating domain-specific content, such as specialized academic texts or technical manuals. These models, trained primarily on large, general-purpose datasets, struggle to accurately capture domain-specific vocabulary and linguistic nuances. For example, terms like ”chromosome” in a microbiology textbook are often poorly translated by these systems. In this thesis, we propose DomainTrans, a MT model designed to enhance translation accuracy for domain-specific texts while maintaining performance on general content. The core idea here is to form a bijection between multi-domain translation and multi-lingual translation. By treating different domain datasets as analogous to distinct languages, we leverage multi-lingual translation techniques to develop a more versatile and adaptable MT model. This thesis also studies and experiments with other key peripherals around this topic. First being domain classification, where we identify the domain to which a given text belongs as a precursor to translation. We particularly lay special emphasis on fine grained domain classification. Next we look at the effect of data availability while fine tuning models for machine translation. Towards the end of the thesis we also explore Knowledge Distillation – the process of condensing the learning done by a large or a set of large models onto a much smaller efficient to deploy model while trying to maintain similar accuracy.
December 2024