Nishant Prabhu – MS CSE -

Nishant Prabhu received his MS Degree in Computer Science and Engineering (CSE). His research work was supervised by Prof. Vasudeva Varma.

Here’s a summary of Nishant’s M.S thesis, Text Simplification: From Daedalian to Simple as explained by him:

The English word ‘daedalian’, which means ‘ingenious, intricate, and confusing’, is a word that originates from the Greek mythological character Daedalus, who is believed to have built a labyrinth so complex, that he himself couldn’t escape its confines- but you would be forgiven for not knowing that. Daedalian is not a commonly used word, and its synonyms like ‘complex’ and ‘confusing’ are favored in day-to-day speech. But like the title of this thesis shows, rare words are still found in written text and this creates a need to simplify this text to make it accessible to the layperson. In this thesis, we study the problem of text simplification, the challenges it presents, and the different ways these challenges can be overcome. This thesis touches upon various topics along the vast expanse of text simplification research, starting from generic improvements to encoder-decoder models that are used for controlled Natural Language Generation (NLG) tasks (including text simplification), to domain-specific text simplification challenges and the specialized solutions that are required to solve them. Recent advances in deep learning and NLG research have enabled sophisticated text simplification models. The central idea behind most approaches to text simplification, and indeed most other controlled text generation tasks, is the encoder-decoder model. Thus, we believe that controlled text generation tasks like text simplification stand to gain by using improved encoder-decoder models. Encoder-decoder models have been vastly improved by the introduction of the attention mechanism, and architectures like the transformer, but there is scope to further improve these models by explicitly improving the encoder outputs. We do this by using a co-attention based loss in addition to the standard cross-entropy loss to train the model. We evaluate this proposal on multiple tasks and show that this not only benefits text simplification but also improves performance in adjacent tasks like style transfer and sentence compression. Another problem with current text simplification research is the lack of suitable datasets that are representative of real-world text simplification needs. We enumerate the problems with existing text simplification datasets and propose a new dataset that comes from diverse sources, has a richer vocabulary and is more representative of real-world text simplification tasks. We present various approaches to simplify text from our proposed dataset and recommend methods that can be used to make the generated text more readable and semantically similar to the source. We also study domain-specific text simplification, with a specific focus on the medical domain. We study the unique challenges that come with simplifying medical text and propose an unsupervised approach to solve the problem of medical text simplification, which goes a long way towards overcoming the challenge of the lack of parallel data in the medical domain. We evaluate our proposed approaches using standard evaluation metrics like BLEU, SARI, and other metrics that have been used for evaluating text simplification models. But these metrics are not perfect and do not always correlate with human judgment. This is especially true when evaluating domain-specific text simplification models, as that usually requires background information about the domain. Therefore, apart from metric-based evaluation, our domain-specific text simplification models also undergo a human evaluation to ensure that their outputs correlate with human judgment. To summarize, this thesis explores the area of text simplification, suggests improvements to the underlying architectures that are used to simplify text, proposes a better dataset for text simplification research, and proposes an unsupervised approach to solve the problem of domain-specific medical text simplification.