[month] [year]

Litton J Kurishinkel – CSE

Litton J Kurishinkel received his doctorate in Computer Science and Engineering (CSE). His research work was supervised by Prof. Vasudeva Varma. Here’s a summary of Litton J Kurishinkel’s thesis, Leveraging Syntactic Information for Coherent and Comprehensible Summarization.

Text summarization is a natural language processing problem which has been investigated by the NLP community for half a century. In the era of information explosion, the community has intensified research for more sophisticated methods for automated text summarization. Attempts were made in the past to frame extractive and abstractive techniques for multi- document summarization. Extractive techniques selects a subset of sentences which can approximate the summary of the input corpus of documents, while abstractive summarization techniques construct a semantic representation and is expected to generate the summary in its own learnt writing style.

Extractive techniques create an intermediate representation for the target text, capturing the key textual features. Possible approaches for intermediate representation are Topic Signatures, Word frequency count, Latent Space Approaches using Matrix Factorizations, or Bayesian approaches. These intermediate representations are then used to assign scores for individual linguistic units within the text and select a subset of linguistic units which maximizes the total score as the summary of the target text. The mathematical scoring function for the summary is generally composed of components to quantify topical coverage and topical diversity. They report the accuracy in terms of a measure called the ROUGE score.

Relatively less work is available on abstractive multi-document summarization in the past. Most of them utilise sub- syntactical structures which are directly extracted from input documents to generate summary sentences. Sub syntactical structures such as phrases are re-organized to create summary sentences using a method which can ensure relevant topical coverage, topical diversity and grammaticality. They also incorporate means to ensure factual accuracy so that sentences generated by the abstract summarization system are factually correct with respect to the original corpus. 

Despite all the attempts to improve summarization in easily quantifiable dimensions such as topical coverage and diversity, a summary needs to be improved in other qualitative dimensions such as comprehensibility and coherence to match with a well-crafted human summary. Coherence represents the presence of inter-sentence structural relationships and topical continuity. Comprehensibility denotes how much a sentence is comprehensible without its context in the source document during an extractive summarization process.

We observe that almost all of the previous work in extractive summarization treats sentences as isolated entities. Linguistically, every sentence is not uttered in isolation, but within the context of a given discourse. Sentences use various discourse connectives binding one another for a coherent reading. We call the set of such structurally related sentences as Locally Coherent Discourse Units (LDU). Major portion of this thesis is talks about our techniques to identify local discourse units and leverage them for improving coherence and comprehensibility of extractive and abstractive summaries.

We introduce rule based and statistical methods for identifying LDUs. Rule based approaches define rules over syntactic trees of sentences to identify structural connection between neighboring sentences in an input document. On the other hand, statistical approaches choose to extract predefined heuristic features from sentences and learn sequence labelling models using an annotated dataset. The model computes the probability of each sentence in the source document to be contextually dependent or independent. Our extractive summarization approaches use these probability values for comprehensible summarization by incorporating them in discrete optimization techniques.

There are existing works for abstractive multi-document summarization which utilise existing phrase structures directly extracted from input documents to generate summary sentences. These methods can suffer from lack of consistency and coherence in merging phrases while framing sentences. We investigate coherence in abstractive document summarization by generating locally coherent discourse units (LDU) from facts. In particular, given a set of documents, we first extract facts from sentences, and then select salient facts that can constitute the summary content using Integer Linear Programming. Such facts are fused into locally coherent discourse units using an encoder-decoder structure with neural attention. This architecture helps in aggregating the available information and adding discourse connections between sentences. The current thesis also includes techniques for coherent summarization which does not use LDU information regarding source Corpus.

One of our works for multi document abstractive summarization introduces a novel approach to improve intra-sentential coherence in abstractive multi-document summarization through partial dependency tree extraction, recombination and linearization. The method entrusts the summarizer to generate its own topically coherent sequential structures from scratch for effective communication. Extracted partial syntactic trees are re-combined using a neural syntactic tree generative model and are re-linearized to ensure coherent reading and factual accuracy. A major component in this approach is a partial tree extraction algorithm.

Informative content extraction from a sentence has traditionally been tackled as syntactic tree pruning, where rules and statistical features are defined for pruning less relevant words. Recent years have witnessed the rise of neural models without leveraging syntax trees, learning sentence representations automatically and pruning words from such representations. We investigate syntax tree based noise pruning methods for neural sentence compression. Our method identifies the most informative regions in a syntactic dependency tree by computing a context based relevance score and maximum density subtree extraction.

Approximating human method of summary writing through automated means demands sophisticated techniques for text understanding, inference making and text generation. Human written summaries encompass the most relevant information in source text and provides a coherent and comprehensible reading. Through the methods explained in this thesis aim to improve system generated summaries on the dimensions of coherence and comprehensibility.