Pratibha Rani received her doctorate in Computer Science and Engineering (CSE). Her research work was supervised by Dr. Vikram Pudi. Here’s a summary of Pratibha Rani’s thesis, Associative Context Classification for Natural Language Processing of Resource-poor Languages as explained by her:
Few resource-rich languages like English and French have been extensively analyzed for Natural Language Processing (NLP) tasks. Domain expertise is essential requirement for studying the properties of a language to build linguistic resources, prepare annotated training data, select appropriate features, configure parameters and set exceptions in the systems for building models for solving NLP tasks of a language. And to build data driven models like, deep learning models, huge amount of annotated or untagged training data is required. So, handling each of the remaining 6500+ “resource-poor” languages would require the same amount of intensive effort, expertise, expenses, time and large training datasets. Hence, it is required that we build domain independent and language independent data driven systems which can work reasonably and effectively with less amount of training data without requiring domain expertise. For this purpose, in this thesis, we propose generic concepts and data driven methods which can be used to build systems for solving the NLP tasks of resource-poor languages. We propose a generic associative classification approach called associative context classification which we have developed using our proposed context based list concept that groups items of some specific context and other proposed parameters and concepts. In our research, we have demonstrated the application of this proposed approach in developing solutions to a few representative Natural Language Processing tasks. We have focused on developing semi-supervised methods using small sized annotated data. Our methods perform well even with less amount of training data without using domain knowledge explicitly and hence, are especially suitable for resource-poor languages which lack domain resources. Our proposed approach is based on associative classification and on the one sense per collocation hypothesis which states that the sense of a word in a document is effectively determined by its context. Hence, our proposed approach can be applied for NLP tasks which depend on collocation property. We have validated the utility of our proposed approach for NLP tasks of resource-poor languages by successfully applying it for developing generic methods for Part-of-Speech tagging and Word Sense Disambiguation tasks. Part-of-speech (POS) tagging is a NLP classification task that assigns a POS tag or other lexical class marker to an item or to each item in the sentence. Here, we use the term “item” to represent all the words and tokens of a language. All the available POS taggers including the state-of-the-art taggers require training data and linguistic resources like dictionaries in large quantities. These taggers do not perform well for resource-poor languages. So, there is a need to develop generic semi-supervised tagging methods which use untagged corpus and require less or no lexical resources. Most of the existing vi vii semi-supervised techniques require large untagged corpus, while for many resource-poor languages, even obtaining a small untagged corpus is hard. Word sense disambiguation (WSD) is a classification task which involves determining the correct meaning of each word in a sentence or phrase based on the neighboring context items. Automated WSD methods use knowledge structures like, WordNet and dictionaries and hand crafted features and rules crafted by domain and linguistic experts from the training data. This is a costly and time taking process and requires extensive amount of domain resources and linguistic expertise. These requirements make it difficult to design a WSD algorithm for resource-poor languages and hence domain independent methods are needed to be developed for this task also. For our experiments, we have cleaned and prepared resource-rich English, resource-moderate Hindi and resource-poor Bengali, Marathi, Tamil, Telugu and Urdu language datasets for POS tagging experiments and English, Hindi and Marathi language datasets for WSD experiments. As part of our research work, we have developed:
- Two semi-supervised POS tagging methods using proposed associative context classification approach.
- Various ensemble POS tagging methods using proposed associative context classification approach, Support Vector Machine, Conditional Random Field, Decision Tree and Semi-supervised Condensed Nearest Neighbor method.
- One semi-supervised associative WSD model using proposed associative context classification approach.
- One ensemble WSD model using proposed associative context classification approach and Support Vector Machine.
- One negation rule finding algorithm for the POS tagged data to find annotation errors from the tagged data.