[month] [year]

G V K Reddy – Tagging of resource-poor languages

G Vamshi Krishna Reddy received his MS Dual Degree in  Computer Science and Engineering (CSE). His research work was supervised by Prof. Vikram Pudi. Here’s a summary of his research work on Semi-supervised ensemble approaches for parts-of-speech tagging of resource-poor languages:

Natural Language Processing (NLP) systems have attracted substantial attention for their potential and propitious commercial value in recent years. Part-of-speech (POS) tagging is a well-studied task and is one of the essential and fundamental steps required for various NLP tasks. Numerous types of supervised, unsupervised, and semi-supervised POS taggers exist for different languages. Each of these POS taggers has its unique strengths. To leverage the combined benefits of these existing taggers, developing ensemble methods is a good option. Ensemble Methods are techniques that create multiple base models and then combine them to produce improved results and usually produce more accurate results than a single model.
Our study investigates ensemble methods to exploit the complementary characteristics of Part-of speech taggers. We present an ensemble technique that enhances the performance of Supervised taggers with the help of Semi-supervised taggers. We propose a two-layer static ensemble with a decision tree classifier as the second layer. Outputs of the first level base POS taggers are chosen as new input features set to the second level decision tree. We use Semi-supervised context based lists (CBLs) tagger as one of the first layer taggers, which uses rich contextual information and helps in tagging both existing and unseen words and uses no domain knowledge, while supervised taggers give good performance for words present in the training model. Enhanced performance of our proposed ensemble approach over the base methods suggests that integrating these methods combines the strengths of base taggers.
Later, the thesis focuses on developing a dynamic classifier selection approach suitable for POS tagging. In Dynamic Classifier Selection (DCS) techniques, the most appropriate set of classifiers is selected dynamically for each test sample. This selection is based on the performance of base classifiers on similar context words. In DCS methods, it is assumed that if a classifier performs well in the region of competence, that classifier is most likely to perform better for the given test instance. Word2Vec, a word Embedding algorithm trained over unlabeled data, is used to pick these similar context-based words from the validation set. Our proposed DCS approach on five languages outperforms the base POS taggers. These results suggest that we successfully selected the appropriate classifier for a given test word.
Overall, we propose ensemble methods to address the Parts-of-speech tagging of resource-poor languages. These methods work with fewer annotated data by using unlabeled data without considering any domain-based knowledge. These ensemble methods can be an excellent choice to integrate and leverage the benefits of various classifiers and help the large number (6500+) of “resource-poor languages” that do not have much-annotated training data.