[month] [year]

Vaishali Pal – MS CSE

Vaishali Pal received her MS in Computer Science and Engineering. Her  research work was supervised by Dr. Manish Shrivastava. Here’s a summary of Vaishali Pal’s MS thesis, Natural answer generation for spoken question answering systems as explained by her:  

Natural Answer Generation(NAG) is the task of generating full-length sentential answers to fact-based questions. This task is a desired attribute of conversational and task-oriented dialogue systems which aim to interact with the user in natural responses. Moreover, multi-modal interfaces enabling speech based input in addition to providing natural responses enhance the ease with which the user interacts with such systems. NAG systems have been widely studied in knowledge-base (KB) question answering tasks where the system generates natural answers to fact-based questions utilizing the structured information in the KB. Even though the task of generating natural answers to textual questions has been explored in recent years, there is little research on natural sentence generation over spoken content. To address this challenge, we propose the task of full-length natural answer generation from spoken question. We approach the task iteratively with sequential sub-tasks. First we study a pointer-generator based full-length natural answer generator system which generates natural answer from a textual question and factoid answer. The pointer-generator system effectively copies facts from the in-put addressing the challenges of out-of-vocabulary words and named-entities present in datasets. To train and evaluate our system, we develop a dataset of 315,000 samples comprising of question, factoid answer and full-length answer triples to train the system. We test our system over cross-domain samples extracted from a knowledge-base dataset such as Freebase and a machine comprehension dataset such as NewsQA. The system achieves a BLEU score of 74.05 and a Rogue-L score of 86.25. As a second sub-task, we study the effect of spoken input on dialogue systems. Spoken dialogue systems typically use one or several (top-N) ASR sequence(s) for inferring the semantic meaning and tracking the state of the dialogue, either with an end-to-end system or with a Spoken Language Understanding (SLU) module associated to a dialogue state tracker (DST). However ASR graphs, such as confusion networks (confnets), provide a compact representation of a richer hypothesis space than a top-N ASR list. In this sub-task, we study the benefits of using confusion networks with a state-of-the-art neural dialogue state tracker (DST). We encode the 2-dimensional confnet into a 1-dimensional sequence of embeddings using an attentional confusion network encoder which can be used with any DST system. Our confnet encoder is plugged into the state-of-the-art ’Global-locally Self-Attentive Dialogue State Tacker’ (GLAD) model for DST and obtains significant improvements in both accuracy and inference time compared to using top-N ASR hypotheses. Subsequently, we address the next sub-task of generating full length natural answer from spoken question. To the best of our knowledge, this is the first attempt towards generating full-length natural answers from spoken content. We approach this task by representing the spoken sequence compactly as a confusion network extracted from a pre-trained Automatic Speech Recognizer. We develop a pointer-generator system over the confusion network using the pre-assigned ASR scores and global attention to copy words from the confusion network and textual factoid answer. We release a large-scale dataset of 259,788 samples of spoken questions, their factoid answers and corresponding full-length textual answers called the ConfNet2Seq dataset. Following our proposed approach, we achieve comparable performance with best ASR hypothesis. Using ASR scores to copy words from the confusion network has the limitation that only the best ASR hypothesis is always chosen to be copied from the confusion network. The final sub-task addresses this issue with a novel hierarchical pointer-generator network which copies words from alternative hypothesis of the confusion network instead of being limited to the best ASR hypothesis. We analyse the benefits of using an ASR graph over the hierarchical pointer-generator network for the task of full-length natural answer generation from spoken questions. We experiment over increasingly noisy data to evaluate the performance of our system in noisy setting and show that our system outperforms the ConfNet2Seq system on the ConfNet2Seq dataset and performs similar to the ConfNet2Seq system over very noisy data. We also perform cross-dataset evaluation to evaluate the efficacy of our method over new domains.