CVIP-2021 -

Faculty and students presented the following papers at the 6th IAPR International Conference on Computer Vision & Image Processing held at IIT Roper in a Hybrid mode from 3 – 5 December:

Evaluation of Detection and Segmentation Tasks on Driving Datasets – Deepak Kumar Singh, Ameet Rahane, Ajoy Mondal, Anbumani Subramanian and C V Jawahar. Research work as explained by the authors:

Object detection, semantic segmentation, and instance segmentation form the bases for many computer vision tasks in autonomous driving. The complexity of these tasks increases as we shift from object detection to instance segmentation. The state-of-the-art models are evaluated on standard datasets such as pascal-voc and ms-cococ, which do not consider the dynamics of road scenes. Driving datasets such as Cityscapes and Berkeley Deep Drive (bdd) are captured in a structured environment with better road markings and fewer variations in the appearance of objects and background. However, the same does not hold for Indian roads. The Indian Driving Dataset (idd) is captured in unstructured driving scenarios and is highly challenging for a model due to its diversity. This work presents a comprehensive evaluation of state-of the-art models on object detection, semantic segmentation, and instance segmentation on-road scene datasets. We present our analyses and compare their quantitative and qualitative performance on structured driving datasets (Cityscapes and bdd) and the unstructured driving dataset (idd); understanding the behavior on these datasets helps in addressing various practical issues and helps in creating real-life applications. Keywords: Object detection, semantic segmentation, instance segmentation.

Link to full paper: https://cdn.iiit.ac.in/cdn/cvit.iiit.ac.in/images/ConferencePapers/2021/Evaluation-Detection.pdf

Towards Label-Free Few-Shot Learning: How Far Can We Go? – Aditya Bharti, Vineeth N B, and C V Jawahar. presented the paper. Research work as explained by the authors:

Few-shot learners aim to recognize new categories given only a small number of training samples. The core challenge is to avoid overfitting to the limited data while ensuring good generalization to novel classes. Existing literature makes use of vast amounts of annotated data by simply shifting the label requirement from novel classes to base classes. Since data annotation is time-consuming and costly, reducing the label requirement even further is an important goal. To that end, our paper presents a more challenging few-shot setting with almost no class label access. By leveraging self-supervision to learn image representations and similarity for classification at test time, we achieve competitive baselines while using almost zero (0-5) class labels. Compared to existing state-of-the-art approaches which use 60,000 labels, this is a four orders of magnitude (10,000 times) difference. This work is a step towards developing few-shot learning methods that do not depend on annotated data. Our code is publicly released at https: //github.com/adbugger/FewShot.

Keywords: Few Shot · Self-supervised · Deep Learning

Link to full paper: https://cdn.iiit.ac.in/cdn/cvit.iiit.ac.in/images/ConferencePapers/2021/Label-Free-Few-Shot.pdf

Classroom Slide Narration System – Jobin K V, Ajoy Mondal, and C V Jawahar. Research work as explained by the authors:

Slide presentations are an effective and efficient tool used by the teaching community for classroom communication. However, this teaching model can be challenging for the blind and visually impaired (vi) students. The vi student required a personal human assistance for understand the presented slide. This shortcoming motivates us to design a Classroom Slide Narration System (csns) that generates audio descriptions corresponding to the slide content. This problem poses as an image-to-markup language generation task. The initial step is to extract logical regions such as title, text, equation, figure, and table from the slide image. In the classroom slide images, the logical regions are distributed based on the location of the image. To utilize the location of the logical regions for slide image segmentation, we propose the architecture, Classroom Slide Segmentation Network (cssn). The unique attributes of this architecture differs from most other semantic segmentation networks. Publicly available benchmark datasets such as WiSe and SPaSe are used to validate the performance of our segmentation architecture. We obtained 9.54% segmentation accuracy improvement in WiSe dataset. We extract content (information) from the slide using four well established modules such as optical character recognition (ocr), figure classification, equation description, and table structure recognizer. With this information, we build a Classroom Slide Narration System (csns) to help vi students understand the slide content. The users have given better feedback on the quality output of the proposed csns in comparison to existing systems like Facebook’s Automatic Alt-Text (aat) and Tesseract. Keywords: Slide image segmentation · Logical regions · Location encoding · Classroom slide narration

Link to full paper: https://cdn.iiit.ac.in/cdn/cvit.iiit.ac.in/images/ConferencePapers/2021/Classroom-Slide.pdf

Handwritten Text Retrieval from Unlabeled Collections – Santhoshini Gongidi and C V Jawahar presented the paper. Research work as explained by the authors:

Handwritten documents from communities like cultural heritage, judiciary, and modern journals remain largely unexplored even today. To a great extent, this is due to the lack of retrieval tools for such unlabeled document collections. This work considers such collections and presents a simple, robust retrieval framework for easy information access. We achieve retrieval on unlabeled novel collections through invariant features learned for handwritten text. These feature representations enable zero-shot retrieval for novel queries on unlabeled collections. We improve the framework further by supporting search via text and exemplar queries. Four new collections written in English, Malayalam, and Bengali are used to evaluate our text retrieval framework. These collections comprise 2957 handwritten pages and over 300K words. We report promising results on these collections, despite the zero-shot constraint and huge collection size. Our framework allows the addition of new collections without any need for specific finetuning or labeling. Finally, we also present a demonstration of the retrieval framework.

Keywords: Document retrieval· Keyword Spotting· Zero-shot retrieval.

Link to full paper: https://cdn.iiit.ac.in/cdn/cvit.iiit.ac.in/images/ConferencePapers/2021/Handwritten-Text.pdf