Agrawal Yash Chandrakant received his MS in Computer Science and Engineering (CSE). His research work was supervised by Prof. Vasudeva Varma. Here’s a summary of his research work on Addressing Domain Specific Challenges in Extractive Summarization and Question Generation:
The phrase coined by Clive Humby in 2006 has proven to be true in the research community where state-of-the-art has been progressively achieved with the advent of sophisticated models as a result of new datasets. Current state-of-the-art models like Transformer-based BERT require a lot of data to train but give higher benchmark results for several tasks. This is good when the problem at hand already has a dataset defined. It becomes challenging when we are addressing a problem for which there exists no dataset. These problems mostly lie in a specific domain. In this thesis, we explore two such domain specific problems and propose an approach to address the problems even when less or no data exists for those problems.
First problem we look at is Automatic Summarization of Financial Annual reports. Financial reports filed by various companies discuss compliance, risks, and future plans, such as goals and new projects, which directly impact their stock price. Quick consumption of such information is critical for financial analysts and investors to make stock buy/sell decisions and for equity evaluations. Hence, we study the problem of extractive summarization of 10-K reports. Recently, Transformer-based summarization models have become very popular. However, lack of in-domain labeled summarization data is a major roadblock to train such finance-specific summarization models. We also show that zero-shot inference on such pre-trained models is not as effective either. We address this challenge by modeling 10-K report summarization using a goal-directed setting where we leverage summaries with labeled goal-related data for the stock buy/sell classification goal. Further, we provide improvements by considering a multi-task learning method with an industry classification auxiliary task. We also show how idiosyncratic factors within a company can be incorporated into the summaries by using the proposed framework. Intrinsic evaluation as well as extrinsic evaluation for the stock buy/sell classification and portfolio construction tasks shows that our proposed method significantly outperforms strong baselines.
Second problem we look at is the Automatic Question Generation from technical text. Asking questions about a subject has been an integral part to test candidates for subject knowledge and understanding.
These questions generally require higher order reasoning and thinking. The answers or reasoning of such questions can span across paragraphs or even documents. Generating such questions has mostly been ignored in current research. We propose fully automated answer unaware question generation for technical text used for evaluating a candidate’s understanding. It follows an unsupervised learning paradigm and does not need labeled training data. It constructs a concept graph (CG) by automatically extracting concepts and inter-concept relations from input text. It then exploits the concept graph to generate questions. Concept graph is used in order to address the issue of inter-sentence question generation. We evaluate the question acceptability using the well-established criteria laid down by Heilman et. al. [23] and show that the proposed approach consistently outperforms the classic linguistics based method as well as state-of-the-art Deep Learning and Transformer based question generation systems. We also propose a ranking mechanism for questions which further increases the number of acceptable questions.
To summarize, we make use of other existing data which is related to the problem but not addresses the exact problem to approach the problem in a novel way by still leveraging the complexity of state-to-the-art data hungry models. We also make use of linguistic clues and other pre-trained models to address the data scarcity issue.