Faculty and students of IIITH presented the following papers at the 16th International Conference on Document Analysis and Recognition (ICDAR – 2021) from 5 – 10 September at Lausanne, Switzerland:
- PALMIRA: A Deep Deformable Network for Instance Segmentation of Dense and Uneven Layouts in Handwritten Manuscripts – S P Sharan, Aitha Sowmya, Amandeep Kumar, Abhishek Trivedi, Aaron Augustine, Sarvadevabhatla, Ravi Kiran.
Research work as explained by the authors: Handwritten documents are often characterized by dense and uneven layout. Despite advances, standard deep network based approaches for semantic layout segmentation are not robust to complex deformations seen across semantic regions. This phenomenon is especially pronounced for the low-resource Indic palm-leaf manuscript domain. To address the issue, we first introduce Indiscapes2, a new large-scale diverse dataset of Indic manuscripts with semantic layout annotations. Indiscapes2 contains documents from four different historical collections and is 150% larger than its predecessor, Indiscapes. We also propose a novel deep network Palmira for robust, deformation-aware instance segmentation of regions in handwritten manuscripts. We also report Hausdorff distance and its variants as a boundary-aware performance measure. Our experiments demonstrate that Palmira provides robust layouts, outperforms strong baseline approaches and ablative variants. We also include qualitative results on Arabic, South-East Asian and Hebrew historical manuscripts to showcase the generalization capability of Palmira .
Link to project page: https://ihdia.iiit.ac.in/Palmira/
Link to full paper: https://arxiv.org/pdf/2108.09436
- An Attentive Deep Network with Fast Marching Distance Maps for Semi-automatic Layout Annotation [Oral presentation] – Abhishek Trivedi and Ravi Kiran Sarvadevabhatla.
Research work as explained by the authors: Precise boundary annotations of image regions can be crucial for downstream applications which rely on region-class semantics. Some document collections contain densely laid out, highly irregular and overlapping multi-class region instances with large range in aspect ratio. Fully automatic boundary estimation approaches tend to be data intensive, cannot handle variable-sized images and produce sub-optimal results for aforementioned images. To address these issues, we propose BoundaryNet, a novel resizing-free approach for high-precision semi-automatic layout annotation. The variable-sized user selected region of interest is first processed by an attention-guided skip network. The network optimization is guided via Fast Marching distance maps to obtain a good quality initial boundary estimate and an associated feature representation. These outputs are processed by a Residual Graph Convolution Network optimized using Hausdorff loss to obtain the final region boundary. Results on a challenging image manuscript dataset demonstrate that BoundaryNet outperforms strong baselines and produces high-quality semantic region boundaries. Qualitatively, our approach generalizes across multiple document image datasets containing different script systems and layouts, all without additional fine-tuning. We integrate BoundaryNet into a document annotation system and show that it provides high annotation throughput compared to manual and fully automatic alternatives.
Link to project page: https://ihdia.iiit.ac.in/BoundaryNet/
Link to full paper: https://arxiv.org/pdf/2108.09433
- DocVisor: A Multi-purpose Web-based Interactive Visualizer for Document Image Analytics [Oral presentation] – Khadiravana Belagavi, Pranav Tadimeti, Ravi Kiran Sarvadevabhatla at 3rd ICDAR Workshop on Open Services and Tools for Document Analysis (ICDAR-OST).
Research work as explained by the authors: The performance for many document-based problems (OCR, Document Layout Segmentation, etc.) is typically studied in terms of a single aggregate performance measure (Intersection-Over-Union, Character Error Rate, etc.). While useful, the aggregation is a trade-off between instance-level analysis of predictions which may shed better light on a particular approach’s biases and performance characteristics. To enable a systematic understanding of instance-level predictions, we introduce DocVisor – a web-based multi-purpose visualization tool for analyzing the data and predictions related to various document image understanding problems. DocVisor provides support for visualizing data sorted using custom-specified performance metrics and display styles. It also supports the visualization of intermediate outputs (e.g., attention maps, coarse predictions) of the processing pipelines. This paper describes the appealing features of DocVisor and showcases its multi-purpose nature and general utility. We illustrate DocVisor’s functionality for four popular document understanding tasks – document region layout segmentation, tabular data detection, weakly-supervised document region segmentation and optical character recognition. DocVisor is available as a documented public repository for use by the community.
Link to project page: https://github.com/ihdia/docvisor
Link to full paper: https://www.springerprofessional.de/docvisor-a-multi-purpose-web-based-interactive-visualizer-for-do/19631320
- MediTables: A New Dataset and Deep Network for Multi-category Table Localization in Medical Documents [Oral presentation] – Akshay Praveen Deshpande, Vaishnav Rao Potlapalli, Ravi Kiran Sarvadevabhatla – Conference: 14th IAPR International Workshop on Graphics Recognition (GREC-2021)
Research work as explained by the authors: Localizing structured layout components such as tables is an important task in document image analysis. Numerous layout datasets with document images from various domains exist. However, healthcare and medical documents represent a crucial domain that has not been included so far. To address this gap, we contribute MediTables, a new dataset of 200 diverse medical document images with multi-category table annotations. Meditables contains a wide range of medical document images with variety in capture quality, layouts, skew, occlusion and illumination. The dataset images include pathology, diagnostic and hospital-related reports. In addition to document diversity, the dataset includes implicitly structured tables that are typically not present in other datasets. We benchmark state of the art table localization approaches on the MediTables dataset and introduce a custom-designed U-Net which exhibits robust performance while being drastically smaller in size compared to strong baselines. Our annotated dataset and models represent a useful first step towards the development of focused systems for medical document image analytics, a domain that mandates robust systems for reliable information retrieval.
Link to project page: https://github.com/atmacvit/meditables
Link to full paper: https://link.springer.com/chapter/10.1007/978-3-030-86198-8_9
- IIIT-INDIC-HW-Words: A Dataset for Indic Handwritten Text Recognition [Poster presentation]– Santhoshini Gongidi and C V Jawahar.
Research work as explained by the authors: Handwritten text recognition (htr) for Indian languages is not yet a well-studied problem. This is primarily due to the unavailability of large annotated datasets in the associated scripts. Existing datasets are small in size. They also use small lexicons. Such datasets are not sufficient to build robust solutions to htr using modern machine learning techniques. In this work, we introduce a large-scale handwritten dataset for Indic scripts containing 868K handwritten instances written by 135 writers in 8 widely-used scripts. A comprehensive dataset of ten Indic scripts are derived by combining the newly introduced dataset with
the earlier datasets developed for Devanagari (IIIT-HW-DEV) and Telugu (IIIT-HW-TELUGU), referred to as the iiit-indic-hw-words.
We further establish a high baseline for text recognition in eight Indic scripts. Our recognition scheme follows the contemporary design principles from other recognition literature, and yields competitive results on English. iiit-indic-hw-words along with the recognizers are available publicly1. We further (i) study the reasons for changes in htr performance across scripts (ii) explore the utility of pre-training for Indic htrs. We hope our efforts will catalyze research and fuel applications related to handwritten document understanding in Indic scripts.
- Asking Questions on Handwritten Document Collections – Minesh Mathew, Lluis Gomez, Dimosthenis Karatzas and C V Jawahar (ICDAR-IJDAR Journal track).
Research work as explained by the authors: This work addresses the problem of Question Answering (QA) on handwritten document collections. Unlike typical QA and Visual Question Answering (VQA) formulations where the answer is a short text, we aim to locate a document snippet where the answer lies. The proposed approach works without recognizing the text in the documents. We argue that the recognition-free approach is suitable for handwritten documents and historical collections where robust text recognition is often difficult. At the same time, for human users, document image snippets containing answers act as a valid alternative to textual answers. The proposed approach uses an off-the-shelf deep embedding network which can project both textual words and word images into a common sub-space. This embedding bridges the textual and visual domains and helps us retrieve document snippets that potentially answer a question. We evaluate results of the proposed approach on two new datasets: (i) HW-SQuAD: a synthetic, handwritten document image counterpart of SQuAD1.0 dataset and (ii) BenthamQA: a smaller set of QA pairs defined on documents from the popular Bentham manuscripts collection. We also present a thorough analysis of the proposed recognition-free approach compared to a recognition-based approach which uses text recognized from the images using an OCR. Datasets presented in this work are available to download at docvqa.org.
- Towards Boosting the Accuracy of Non-Latin Scene Text Recognition – C V Jawahar, Sanjana Gunna, Rohit Saluja and C V Jawahar (ICDAR- ASAR 2021, 4th edition).
Research work as explained by the authors: Scene-text recognition is remarkably better in Latin languages than the non-Latin languages due to several factors like multiple fonts, simplistic vocabulary statistics, updated data generation tools, and writing systems. This paper examines the possible reasons for low accuracy by comparing English datasets with non-Latin languages. We compare various features like the size (width and height) of the word images and word length statistics. Over the last decade, generating synthetic datasets with powerful deep learning techniques has tremendously improved scene-text recognition. Several controlled experiments are performed on English, by varying the number of (i) fonts to create the synthetic data and (ii) created word images. We discover that these factors are critical for the scene-text recognition systems. The English synthetic datasets utilize over 1400 fonts while Arabic and other non-Latin datasets utilize less than 100 fonts for data generation. Since some of these languages are a part of different regions, we garner additional fonts through a region-based search to improve the scene-text recognition models in Arabic and Devanagari. We improve the Word Recognition Rates (WRRs) on Arabic MLT-17 and MLT-19 datasets by 24.54% and 2.32% compared to previous works or baselines. We achieve WRR gains of 7.88% and 3.72% for IIIT-ILST and MLT-19 Devanagari datasets.
ICDAR is a premier international event for scientists and practitioners involved in document analysis and recognition, a field of growing importance in the current age of digital transition. In 2021, the 16th edition of this flagship conference was held for the first time in Switzerland.
Link to conference page: https://icdar2021.org/