[month] [year]

Aashna Jena – Data Recasting

Aashna Jena received her Master of Science  – Dual Degree in  Computational Linguistics (CL).  Her research work was supervised by Dr. Manish Shrivastava. Here’s a summary of her  research work on Data Recasting for Natural Language Inference on Tables:

Given a premise, the aim of Natural Language Inference is to identify a hypothesis as Entailed,  Refuted, or Neutral. To do such classification, a model must acquire the ability to reason over the premise. While entailment tasks have been extensively studied with unstructured text as the premise, there is an increasing demand for learning to reason over semi-structured and organised data formats such as tables, knowledge graphs, databases, and combinations thereof. Structured data forms differ from unstructured text in the way they capture information and relationships – not just via language, but also through position and structure. Particularly, tables capture the connections between cells, which represent isolated distinct entities. Tabular data is organised so that items of the same kind are grouped together in rows, columns, or both. Consequently, it is straightforward to infer rankings, trends, unique items, and aggregate values from tabular data. These sorts of reasoning are specific to structured data formats, which makes inference on tables a difficult task requiring separate effort from inference on plain text.

Creating challenging tabular inference data for supervision is necessary for mastering complex reasoning. Prior research in this sector has predominantly employed two data generating methodologies.

The first technique is human annotation. This results in data that is inventive, fluent, and linguistically diverse. However, human annotation is costly and time-consuming, making it difficult to scale. The second form of data production is through synthetic means, where the data is generated using a defined set of rules or context-free grammar. This system is easily scalable in terms of both time and cost, but it lacks originality. Its results are predictable and adhere to predetermined patterns and fixed vocabulary.

This research presents a framework for semi-automatically “recasting” existing tabular data in order to mitigate the drawbacks of both of the aforementioned data generation techniques. Existing data is perturbed, modified, and augmented through recasting to conform to the specifications of a given target task, which is Tabular Inference in this case. This framework is used to construct tabular NLI instances from five datasets that were originally designed for tasks such as table-to-text generation, tabular question answering, and tabular semantic parsing.

To demonstrate the utility and quality of these datasets, this thesis explains how recasted data may be utilised as evaluation benchmarks and augmentation data to improve performance on tabular NLI tasks such as TabFact. In addition, this work evaluates the efficacy of models trained on recasted data in the zero-shot setting and examines performance trends across different types of recasted datasets. This thesis concludes with a discussion of the limitations and potential future paths of this field of study.

May 2023

  •  
  •  
  •  
  •  
  •  
  •  
  •  
  •