Tanmai Khanna received his MS Dual Degree in Computational Linguistics (CL). His research work was supervised by Prof. Dipti M Sharma. Here’s a summary of Tanmai Khanna‘s thesis, Rule-based pre-processing of idioms and non-compositional constructions to simplify them and improve black-box machine translation.
Machine Translation is a sub-field of Computational Linguistics that deals with systems that automatically translate text or an utterance from one language into another. The field has made huge strides, from rule-based translators to the now state-of-the-art neural machine translators. However, translators in the general domain are far from achieving human parity. At this stage, even for state-of-the-art translators, a human needs to check the output and post-edit it. The aim therefore, is to continuously improve the machine translator so that the post-editing effort reduces. Guided by this aim and a will to make translators more complete, I identify one area where even state-of-the-art machine translators perform poorly – translating idiomatic and non-compositional constructions. I establish that these constructions are infrequent in the data translators are trained and tested on, which could be the primary reason for the inadequacy of their translations. To improve the translation adequacy of non-compositional constructions, I propose a rule based pre-processor that detects these constructions in the input sentence and simplifies them into more compositional constructions – which are far more likely to translate adequately.
I start by compiling a list of constructions, which range from fully lexical and rigid constructions to constructions with slots constrained by parts of speech, all the way to fully syntactic constructions. Using examples of sentences that have these constructions, I evaluate five English–Hindi NMT systems and report that their performance is thoroughly inadequate when translating non-compositional constructions. To understand the capabilities that a pre-processor would need to solve this issue, I conduct an analysis of these constructions, and come up with features that a rule-based pre-processor needs to detect and simplify the constructions. This theoretical analysis of rules is followed by a description of the actual pre-processor I created as part of this project and its rule formalism. The pre-processor is then systematically evaluated for English–Hindi translation. I report a high accuracy of construction detection in English based on manually written rules, as well as a significant improvement in the quality of translation in Hindi after the non-compositional constructions in the English input text are preprocessed into more compositional constructions. I conclude by discussing the results of this evaluation and the errors made by the pre-processor, some limitations of this solution, as well as possible future work to extend this system and improve it.
The preprocessor is open-source and its code is available at https://github.com/
khannatanmai/rule-based-preprocessing-mt.