Faculty and students presented the following papers at the 12th Indian Conference on Computer Vision, Graphics and Image Processing  (ICVGIP 2021) hosted by IIT Jodhpur virtually from 19 – 22 December. 

  • Intelligent Video Editing: Incorporating Modern Talking Face Generation Algorithms in a Video Editor – Anchit Gupta; Faizan Farooq Khan; Rudrabha Mukhopadhyay; Vinay Namboodri, University of Bath and C V Jawahar. Research work as explained by the authors:

This paper proposes a video editor based on OpenShot with several state-of-the-art facial video editing algorithms as added functionalities. Our editor provides an easy-to-use interface to apply modern lip-syncing algorithms interactively. Apart from lip-syncing, the editor also uses audio and facial re-enactment to generate expressive talking faces. The manual control improves the overall experience of video editing without missing out on the benefits of modern synthetic video generation algorithms. This control enables us to lipsync complex dubbed movie scenes, interviews, television shows, and other visual content. Furthermore, our editor provides features that automatically translate lectures from spoken content, lip-sync of the professor, and background content like slides. While doing so, we also tackle the critical aspect of synchronizing background content with the translated speech. We qualitatively evaluate the usefulness of the proposed editor by conducting human evaluations. Our evaluations show a clear improvement in the efficiency of using human editors and an improved video generation quality.

Link to full paper: 



  • Translating Sign Language Videos to Talking Faces – Seshadri Majumder; Rudrabha Mukhopadhyay; Vinay Namboodri, University of Bath and C V Jawahar.  Research work as explained by the authors:

Communication with the deaf community relies profoundly on the interpretation of sign languages performed by the signers. In light of the recent breakthroughs in sign language translations, we propose a pipeline that we term “Translating Sign Language Videos to Talking Faces”. In this context, we improve the existing sign language translation systems by using POS tags to improve language modeling. We further extend the challenge to develop a system that can interpret a video from a signer to an avatar speaking in spoken languages. We focus on the translation systems that attempt to translate sign languages to text without glosses, an expensive annotation form. We critically analyze two state-of-the-art architectures, and based on their limitations, we improvise the systems. We propose a two-stage approach to translate sign language into intermediate text followed by a language model to get the final predictions. Quantitative evaluations on the challenging benchmarks on RWTH-PHOENIX-Weather 2014 T show that the translation accuracy of the texts generated by our translation model improves the state-of-the-art models by approximately 3 points. We then build a working text to talking face generation pipeline by bringing together multiple existing modules. The overall pipeline is capable of generating talking face videos with speech from sign language poses. Additional materials about this project including the codes and a demo video can be found in https://seshadri-c. github.io/SLV2TF/ 

Link to full paper: 



  • NTU-X: An Enhanced Large-scale Dataset for Improving Pose-based Recognition of Subtle Human Actions – Neel Trivedi, Anirudh Thatipelli, Ravi Kiran Sarvadevabhatla. Research work as explained by the authors: 

The lack of fine-grained joints (facial joints, hand fingers) is a fundamental performance bottleneck for state of the art skeleton action recognition models. Despite this bottleneck, community’s efforts seem to be invested only in coming up with novel architectures. To specifically address this bottleneck, we introduce two new pose based human action datasets – NTU60-X and NTU120-X. Our datasets extend the largest existing action recognition dataset, NTU-RGBD. In addition to the 25 body joints for each skeleton as in NTU-RGBD, NTU60-X and NTU120-X dataset includes finger and facial joints, enabling a richer skeleton representation. We appropriately modify the state of the art approaches to enable training using the introduced datasets. Our results demonstrate the effectiveness of these NTU-X datasets in overcoming the aforementioned bottleneck and improve state of the art performance, overall and on previously worst performing action categories. Code and pretrained models can be found at https://github.com/skelemoa/ntu-x.

Project page: https://skeleton.iiit.ac.in/ntux

Paper pdf : https://arxiv.org/pdf/2101.11529

Code Repository: https://github.com/skelemoa/ntu-x



  • Deformable Deep Networks for Instance Segmentation of Overlapping Multi Page Handwritten Documents – Aitha Sowmya, Bollampalli Sindhu, Sarvadevabhatla  Ravi Kiran. Research work as explained by the authors: 


Digitizing via scanning the physical artifact often forms the first primary step in preserving historical handwritten manuscripts. To maximally utilize scanner surface area and minimize manual labor, multiple manuscripts are usually scanned together into a scanned image. Therefore, the first crucial task in manuscript content understanding is to ensure that each of the individual manuscripts within a scanned image can be isolated (segmented) on a per-instance basis. Existing deep network based approaches for manuscript layout understanding implicitly assume a single or two manuscripts per image. Since this assumption may be routinely violated, there is a need for a precursor system which extracts individual manuscripts before downstream processing. Another challenge is the highly curved and deformed boundaries of manuscripts, causing them to often overlap with each other. To tackle such challenges, we introduce a new document image dataset called IMMI (Indic Multi Manuscript Images). To improve the efficiency of dataset and aid deep network training, we also propose an approach which generates synthetic images to augment sourced non-synthetic images. We conduct experiments using modified versions of existing document instance segmentation frameworks. The results demonstrate the efficacy of the new frameworks for the task. Overall, our contributions enable robust extraction of individual historical manuscript pages. This in turn, could potentially enable better performance on downstream tasks such as region-level instance segmentation within handwritten manuscripts and optical character recognition.

Paper pdf: https://doi.org/10.1145/3490035.3490278

Code Repository: https://github.com/ihdia/im2pages


  •  Automatic Quantification and Visualization of Street Trees – Arpit Bahety, Rohit Saluja, Ravi Kiran Sarvadevabhatla, Anbumani Subramanian, C V Jawahar. Research work as explained by the authors: 

Assessing the number of street trees is essential for evaluating urban greenery and can help municipalities employ solutions to identify tree-starved streets. It can also help identify roads with different levels of deforestation and afforestation over time. Yet, there has been little work in the area of street trees quantification. This work first explains a data collection setup carefully designed for counting roadside trees. We then describe a unique annotation procedure aimed at robustly detecting and quantifying trees. We work on a dataset of around 1300 Indian road scenes annotated with over 2500 street trees. We additionally use the five held-out videos covering 25 km of roads for counting trees. We finally propose a street tree detection, counting, and visualization framework using current object detectors and a novel yet simple counting algorithm owing to the thoughtful collection setup. We find that the high-level visualizations based on the density of trees on the routes and Kernel Density Ranking (KDR) provide a quick, accurate, and inexpensive way to recognize tree-starved streets. We obtain a tree detection mAP of 83.74% on the test images, which is a 2.73% improvement over our baseline. We propose Tree Count Density Classification Accuracy (TCDCA) as an evaluation metric to measure tree density. We obtain TCDCA of 96.77% on the test videos, with a remarkable improvement of 22.58% over baseline, and demonstrate that our counting module’s performance is close to human level.

Source code: https://github.com/iHubData-Mobility/public-tree-counting

Paper pdf: https://dl.acm.org/doi/10.1145/3490035.3490280, http://cdn.iiit.ac.in/cdn/cvit.iiit.ac.in/images/ConferencePapers/2021/Automatic_tree.pdf

Code Repository: https://github.com/iHubData-Mobility/public-tree-counting


  •  Monocular Multi-Layer Layout Estimation for Warehouse Racks – Meher Shashwat Nigam, Avinash Prabhu, Tanvi Karandikar, Puru Gupta, Sai N Shankar, Ravi Kiran Sarvadevabhatla, K Madhava Krishna. Research work as explained by the authors: 

Given a monocular color image of a warehouse rack, we aim to predict the bird’s-eye view layout for each shelf in the rack, which we term as ‘multi-layer’ layout prediction. To this end, we present RackLay, a deep neural network for real-time shelf layout estimation from a single image. Unlike previous layout estimation methods which provide a single layout for the dominant ground plane alone, RackLay estimates the top-view and front-view layout for each shelf in the considered rack populated with objects. RackLay’s architecture and its variants are versatile and estimate accurate layouts for diverse scenes characterized by varying number of visible shelves in an image, large range in shelf occupancy factor and varied background clutter. Given the extreme paucity of datasets in this space and the difficulty involved in acquiring real data from warehouses, we additionally release a flexible synthetic dataset generation pipeline WareSynth which allows users to control the generation process and tailor the dataset according to the contingent application. The ablations across architectural variants and comparison with strong prior baselines vindicate the efficacy of RackLay as an apt architecture for the novel problem of multi-layered layout estimation. We also show that fusing the top-view and front-view enables 3D reasoning applications such as metric free space estimation for the considered rack. 

Paper pdf: https://arxiv.org/pdf/2103.09174

Code Repository: https://github.com/Avinash2468/RackLay


The Indian Conference on Computer Vision, Graphics and Image Processing (ICVGIP) is India’s premier conference in Computer Vision, Graphics, Image Processing and related fields. Started in 1998, it is a biennial international conference providing a forum for presentation of technological advances and research findings in these areas. ICVGIP 20-21, the 12th conference in this series, was organized by IIT Jodhpur in association with the Indian Unit for Pattern Recognition and Artificial Intelligence (IUPRAI), an affiliate of the International Association for Pattern Recognition (IAPR).
ICVGIP is dedicated to fostering the community of computer vision, graphics and image processing researchers and enthusiasts in India and abroad.