Suggu Sai Praneeth received his MS Dual Degree in Computational Linguistics (CL). His research work was supervised by Dr. Manish Shrivastava. Here’s a summary of Suggu Sai Praneeth’s thesis External Knowledge Sources for Answer Quality Prediction and Sub-Topic Detection as explained by him:
Community Question Answering forums have become a popular medium for soliciting direct answers to specific questions of users from experts or other experienced users on a given topic. However, for a given question, users sometimes have to sift through a large number of low-quality or irrelevant answers to find out the answer which satisfies their information need. To alleviate this, the problem of Answer Quality Prediction aims to predict the quality of an answer posted in response to a forum question. Current Answer Quality Prediction systems either learn models using – a) various Hand-Crafted Features or b) Deep Learning techniques which automatically learn the required feature representations. In the recent past, the web has not only become the predominant source of information, but also a crucial player in the evolution of events. People reporting information about real time events has made it a significant indicator to detect and estimate the pulse of various communities. Currently, the most amount of real-time information about events is generated by social networks like Blogs, Twitter, Facebook etc. as they have become a major tool for sharing events, expressing opinions and communicating with friends. Fresh latent sub-topics identified from Twitter feeds at any given point of time could be extremely useful in providing better topic-wise search results relevant to users’ informational needs. The task of sub-topic detection from tweets is challenging because tweets are noisy and ambiguous. These sub-topics cannot be manually identified given the large scale of twitter and the velocity with which new sub-topics emerge and become trending. The last two decades have witnessed an exponential rise in web content from a plethora of genres, which has necessitated the use of genre-specific search engines. The diversity of crawl is one of the pivotal aspects of a genre specific search engine. To a large extent, it is governed by the initial set of seed URLs. To ensure a diverse crawl there must be diversity within the seed URLs. For selecting seed URLs, most of the existing approaches rely on the manual effort. We automate this process by selectively picking URLs posted on Twitter and using them as seed URLs. In this thesis, we propose a novel approach for Answer Quality Prediction known as – “Deep Feature Fusion Network (DFFN)” which combines the advantages of both hand-crafted features and deep learning based systems. Given a question-answer pair along with its metadata, the Deep Feature Fusion Network architecture independently – a) learns features from the Deep Neural Network and b) computes hand-crafted features using various external resources and then combines them using a fully connected neural network trained to predict the final answer quality. Deep Feature Fusion Network is an end-end differentiable model and trained as a single system. We propose two different DFFN architectures which vary mainly in the way they model the input question/answer pair – a) DFFN-CNN uses a Convolutional Neural Network and b) DFFN-BLNA uses a Bi-directional Long Short-Term Memory with Neural Attention . We achieve a MAP of 83.91 which is the state-of-the-art performance on the standard benchmark datasets and also outperform baseline approaches which individually employ either Hand Crafted Features or Deep Learning based techniques alone. To detect sub-topics from tweets, we generate a rich semantic representation of tweets using external knowledge bases and combine it with relevant similarity metrics to train a sub-topic detection classifier. We obtained an F-measure of 41.7. We do comparisons with various state-of-the-art mechanisms for sub-topic detection, on a standard benchmark dataset show that the proposed approach outperforms others by a significant margin. Finally, we propose a graph based algorithm to get a set of diverse seed URLs from Twitter. Each vertex in the graph is a URL coming from Twitter. Two vertices are connected if there is some similarity between them, which is measured using the information in the corresponding tweets: tweet content, URL n-grams, and retweet. Our algorithm then picks the URLs from the graph in such a way that the resulting set consists of diverse seed URLs. We experiment with the tourism genre and find that our approach is indeed able to capture a diversity of 63.4 within our seed set. We also propose several metrics for evaluating the diversity of the selected seed URLs.