[month] [year]

Jobin K V

Jobin K V supervised by Prof. Anoop Namboodiri received his doctorate in Electronics and Communication Engineering (ECE). Here’s a summary of his research work on Document Image Layout Segmentation and Applications:

A document has information emerging out of the existence of various physical entities or regions such as headings, paragraphs, figures, captions, tables, and backgrounds along with the textual content. To decipher the information in a document, a human reader uses a variety of additional cues, such as context, conventions, and information about language, script, location, and a complex reasoning process. Millions of documents are created and distributed daily over the Internet and printed media. Understanding, analyzing, sorting, and comparing a massive collection of documents in a limited time is a hectic job for humans. Here, the automatic document image understanding systems (DIUS) help humans do this tedious task within a limited time. The DIUS typically has a document image layout segmentation module and information extraction modules. This proposal focuses on the challenging problems related to document image layout segmentation in various types of documents and their applications.

In this thesis, first, we analyse the various document images using deep features. The deep features are the features extracted using a pretrained deep neural network. To study deep texture features, we propose a deep network architecture that independently learns texture patterns, discriminative patches, and shapes to solve various document image analysis tasks. The considered tasks are document image classification, genre identification from book covers, scientific document figure classification, and script identification. The presented network learns global, texture, and discriminative features and combines them judicially based on the nature of the problems. We compare the performance of the proposed approach with state-of-the-art techniques on multiple publicly available datasets such as Book covers, RVL-CDIP, CVSI and DocFigure. Experiments show that our approach obtains significantly better performance over state-of-the-art for all tasks.

Next, we focus on the problem of document image layout segmentation and propose a solution for a class of document images, including historical, scientific, and classroom slide document images. The historical document image segmentation problem is modeled as a pixel labeling task where each pixel in the document image is classified into one of the predefined labels, such as text, comment, decoration, and background. The method first extracts deep features from the superpixels of the document image.

Then, we learn the SVM classifier using these features and segment the document image. The pyramid pooling module is used to extract the logical regions of scientific document images. In the classroom slide images, the logical regions are distributed based on the location of the image. To utilize the location of the logical regions for slide image segmentation, we propose the architecture, Classroom Slide Segmentation Network (CSSN). The unique attributes of this architecture differ from most other semantic segmentation networks. We validate the performance of our segmentation architecture using publicly available benchmark datasets.

Next, we analyze the output regions of document layout segmentation. Figures used in the documents are the complex regions to decipher the information by a DIUS. Hence, document figure classification (DFC) is an important stage of the DIUS. The design of a DFC system required well-defined figure categories and datasets. Existing datasets related to the classification of figures in the document images are limited in size and category. We introduce a scientific figure classification dataset named DocFigure.

The dataset consists of 33K annotated figures of 28 different categories present in the document images, which correspond to scientific articles published in the last several years. Manual annotation of such a large number (33K) of figures is time-consuming and cost-ineffective. We design a web-based annotation tool that can efficiently assign category labels to many figures with the minimum effort of human annotators. To benchmark our generated dataset on the classification task, we propose three baseline classification techniques using the deep feature, deep texture feature, and both. Our analysis found that the combination of both deep and texture features is more effective for document figure classification tasks than individual features.

Finally, we propose the application backed by the research of this thesis. Slide presentations are an effective and efficient tool for classroom communication used by the teaching community. However, this teaching model can be challenging for blind and visually impaired (VI) students as such students require personal human assistance to understand the presented slide. This shortcoming motivates us to design a Classroom Slide Narration System (CSNS) that generates audio descriptions corresponding to the slide content. This problem poses an image-to-markup language generation task. Extract logical regions such as title, text, equation, figure, and table from the slide image using CSSN. We extract the content (information) from the slide using four well-established modules: optical character recognition (OCR), figure classification, equation description, and table structure recognizer. With this information, we build a Classroom Slide Narration System (CSNS) to help VI students understand the slide content. The users have given better feedback on the quality output of the proposed CSNS in comparison to existing systems like Facebook’s Automatic Alt-Text (AAT) and Tesseract.

 April 2025