Aamir Farhan - Urdu Text Processing -

Aamir Farhan received his Master of Science – Dual Degree in Computer Science and Engineering (CSE). His research work was supervised by Prof. Dipti M Sharma. Here’s a summary of his research work on Development and Enhancement of Tools and Resources for Urdu Text Processing:

Urdu writing system is derived from the Persio-Arabic writing systems and thus it has adopted similar orthographical and morphological characteristics as that of Persio-Arabic languages. The first and foremost task for most of the NLP applications is Word Segmentation which involves identifying the bounding boundaries of words in written text. It is quite crucial to accurately identify the boundaries of each word in written text because all the downstream tasks in NLP are dependent on it, thus making Word Segmentation fundamentally important. Urdu adopts a continuous writing style which does not have an explicit and clear marker for word boundary. Furthermore, the inherent non-joining attributes of certain characters in Urdu create spaces within a word while writing in digital format. Thus, Urdu not only has space omission but also space insertion issues which make the word segmentation task challenging. We have studied and categorized the various issues that are observed with respect to the inconsistent usage of space character in Urdu script along with the orthographic and morphological reason behind it. Another challenge in computational processing of Urdu is the lack of benchmark resources and corpora for Word boundary identification.

Leveraging the learning from the orthographic study of Urdu writing system, we have built a benchmark corpus for Urdu Word Segmentation, with an exercise of manual annotation, using white space as word boundary and Zero-Width Non-Joiner (ZWNJ) character as sub-word boundary. A Conditional Random Field based sequence modeler was then used to train a character-level label prediction of a sequence of Urdu characters. Our model achieved state-of-the-art results with an F 1 score of 0.98 for word boundary identification. Furthermore, we have applied our word segmentation model on studying the sociological phenomena of Diglossia in Urdu.

As part of the research on the topic of this thesis, we have done an intensive and thorough study and analysis of the Urdu’s writing system, orthography, morphology and sociology. We have identified and classified the numerous writing rules and measures which give rise to various text processing and segmentation challenges. The major problem that we have solved as part of this research is the Word Segmentation problem in Urdu which involves identifying word peripheries in written Urdu by handling the unique and fundamental issues which make this task quite challenging and critical. The key contributions of this research work are:

Manually annotated benchmark corpus for Urdu Word Segmentation task, which is the biggest in terms of number of sentences so far.
State-of-the-art Urdu word tokenization model with optimally crafted feature set.
First of its kind handling and annotation of special grammatical constructions in Urdu such as Izafa constructions.
A social study and analysis of Urdu language which validates the existence of Diglossic situation of Urdu in South-Asian countries

April 2023

Aamir Farhan – Urdu Text Processing