[month] [year]

Malireddy Chanakya – Dual Degree CL

Malireddy Chanakya received his MS  in Computer Linguistics (CL). His  research work was supervised by Dr. Manish Shrivastava. Here’s a summary of  Malireddy Chanakya’s  MS  thesis, Unsupervised Extractive Sentence Compression as explained by him: 

The internet has heralded an age of information explosion. A simple search query yields thousands of results.  Each of these, in turn, is linked to a lengthy and verbose document.  There is an immediate need to find faster and efficient ways to consume this ever-growing ocean of information. The purpose of summarization is to generate a shorter representation of the input that captures its gist and preserves its intent. In this work, we explore extractive summarization techniques, which involve identifying and extracting the most informative parts of a text.

Most extractive summarization systems have been developed for the domains of newswire articles and scientific texts.  These systems usually operate by ranking all the source sentences, according to some heuristic or metric, and then select the top sentences as the summary. They work well in domains where the discourse revolves around a central theme and information is often enforced by reiteration across several sentences. However, documents such as fictional narratives do not talk about a single topic.  They describe a sequence of events and often contain dialogue.  They do not contain repetitive information and each sentence contributes to developing the plot further.  Hence, selecting a subset of sentences does not accurately capture the story.

In the first part of this thesis, we discuss telegraphic summarization.  Just like a telegram, a tele-graphic summary does not contain complete sentences.  Instead, shorter phrases are extracted across multiple sentences to capture the crux of the document. This summarization technique is better suited for the domain of fiction. We describe a set of guidelines to create such summaries.  These guidelines are used to annotate a gold corpus of 200 English short stories.

In the later part of this thesis, we model the task of telegraphic summarization as the well-known sentence compression task. Sentence compression is the task of shortening a sentence while retaining its meaning. Most methods proposed for this task rely on labeled or paired corpora (containing pairs of verbose and compressed sentences), which are often expensive to collect. 

To overcome this limitation, we present a novel unsupervised deep learning framework (SCAR) for deletion-based sentence compression. SCAR is primarily composed of two encoder-decoder pairs: a compressor and a reconstructor. The compressor generates a sequence of zeroes and ones, representing token deletion decisions, used to mask the input sentence. The reconstructor tries to regenerate the input from the masked sentence.  The model is trained completely on unlabeled data and does not require additional inputs such as explicit syntactic information or optimal compression length.  SCAR’s merit lies in the novel Linkage loss function, which correlates the compressor and its effect on reconstruction,guiding it to drop inferable tokens.  SCAR achieves higher ROUGE scores than the existing state-of-the-art methods and baselines on benchmark datasets.   We also conduct a user study to demonstrate the  application  of  our  model  as  a  text  highlighting  system.   Using  our  model  to  underscore  salient information facilitates speed-reading and reduces the time required to skim a document.