[month] [year]

T Abhishek – Natural Language Generation

December 2022

Tushar Abhishek received his Master of Science in  Computer Science and Engineering (CSE).  His research work was supervised by Prof. Vasudeva Varma. Here’s a summary of his research work on Importance of Facts for Natural Language Generation:

Natural Language Generation is the task of producing understandable human text from a variety of input sources. With the advent of pretrained language models (PLMs), text generation capabilities of current systems have achieved unprecedented heights. Pretrained language models have become the new normal and naturally serve as backbone architecture for numerous tasks. It has been observed that these models learn various intricacies involved in language understanding and generation during the pre-training step. These models are pre trained over large corpus; they also discover world knowledge through text, some of which are absorbed in model parameters. Although, during fine tuning on certain knowledge intensive tasks (like text coherence, data-to-text, summarization, translation, etc.), it fails to utilize these intrinsic knowledge facts stored in its parameters effectively. In this thesis, we will tackle this problem by incorporating external facts to improve results on two downstream tasks: multilingual facts-to-text generation and text coherence modeling. 

We observe the close association of facts in improving text generation by directly focusing on fact-to- text generation tasks. The fact-to-text generation is a variant of data-to-text where structured input data is knowledge graph triples. Data-to-text generative system consumes structured input data like tables, databases, knowledge bases, time-series data etc., and produces human-readable text summaries. The first part of the thesis addresses the multilingual fact-to-text generation, where facts are used to generate sentences in multiple languages. The fact-to-text generation requires a dataset of knowledge graph triples well-aligned with semantically equivalent textual information. Manual creation of such a high- quality fact-to-text dataset requires human supervision and is quite challenging to scale. Unsupervised alignment has recently emerged as an active area of research to overcome lack of labeled data and difficulty in domain adaptation. However, not much work has been done for low-resource languages that provide two significant challenges: (1) unavailability of pair of triples and native text with same content distribution and (2) limited Natural language Processing resources in low-resource languages. Hence, we rigorously investigate the cross-lingual fact-to-text problem of aligning English structured data with sentences in multiple low-resource languages and develop a new dataset called XA LIGN consisting of 0.45M pairs across seven low-resource languages. We propose two different methods of cross-lingual fact-to-text alignment: (a) Non-parametric approaches and (b) Parametric approaches. Additionally, we experimented with strong baseline results by adapting popular natural language generation methods for the cross-lingual fact-to-text task. 

An essential requirement for any system that generates text is coherence of its output. In the second part of the thesis, we address detection of text coherence. A large body of previous work has leveraged entity-based methods, syntactic patterns, discourse relations, and traditional deep learning architectures for text coherence assessment. However, these approaches do not consider factual information present in the documents. Transitions of facts associated with entities across sentences could help better capture the essence of textual coherence. We hypothesize that coherence assessment is a cognitively complex task that requires deeper fact-aware models and can benefit from other related tasks. To demonstrate this, we develop a novel deep learning model that fuses document-level information with factual information. We further enhance the model efficacy by training it simultaneously with Natural Language Inference tasks in multi-task learning settings, taking advantage of inductive transfer between the two tasks. Our experiments with popular benchmark datasets across multiple domains demonstrate that the proposed model consistently outperforms existing methods on synthetic coherence evaluation tasks and two real-world tasks involving predicting varying degrees of coherence.