Kaveri Anuranjana received her MS Dual Degree in Computer Linguistics (CL). Her research work was supervised byDr. Radhika Mamidi. Here’s a summary of Kaveri Anuranjana’s MS thesis, Towards building question answering resources for Hindi as explained by her:
With information explosion, we have more resources available. The size of just the English language Wikipedia articles is about 159.69 GB. As the resources we consume keep getting bigger and more complex, we need systems to query these datasets. To query large databases, simple Information Retrieval techniques like fetching keywords and combining them with Boolean logic to get results were used. But now, we have search engines that can take queries in the form of keywords, a combination of keywords and even questions. The retrieval side of these search engines also started with simple document retrieval methods like document frequencies and inverse document frequencies and evolved to complex ranking methods based on neural networks that can fetch entire questions. Question Answering is a major field in NLP. It combines query formulation and document retrieval and in some cases, summarization of the retrieved documents. We now have complete end-to-end based systems that can fetch a suitable answer from a corpus after being trained on large datasets. These systems can be trained on large datasets and these datasets can be closed domain or open domain. However, these systems rely heavily on curated datasets which require a lot of energy and resources to be spent on them. For Indian languages where reliable and substantial datasets are scarce, neural networks that rely purely on data cannot be used. Hence, introducing data independent techniques or providing more data becomes important. We shall explore both these aspects. First, we present a Reading Comprehension Task in Hindi – HindiRC. The dataset has been divided into different grades based on reading proficiency and we perform the baseline experiment on each grade separately which shows the increasing level of difficulty. Using various linguistic cues and metrics we further prove that the grades are reflective of linguistic complexity. In addition to addressing the data scarcity, we propose a Hindi Question Generation methodology. The rule-based method is based on karaka roles generated by a dependency parser. No additional resources are required and it can be used to increase the number of questions for Hindi datasets. We also prove that the generation method tends to overgenerate questions; further inflating the number. Along with the dataset and question generation method, we aim to provide more resources for Question Answering in Hindi.