Aditya Arun -

Aditya Arun supervised by Prof. Jawahar C V received his doctorate in Computer Science and Engineering (CSE). Here’s a summary of his research work on Learning with Weak Supervision for Visual Scene Understanding:

In recent years, computer vision has made remarkable progress in understanding visual scenes, including tasks such as object detection, human pose estimation, semantic segmentation, and instance segmentation. These advancements are largely driven by high-capacity models, such as deep neural networks, trained in fully supervised settings with large-scale labeled data sets. However, reliance on extensive annotations poses scalability challenges due to the significant human effort required to create these data sets. Fine-grained annotations, such as pixel-level segmentation masks, key point coordinates for pose estimation, or detailed object instance boundaries, provide the high precision needed for many tasks but are extremely time-consuming and costly to produce. Coarse annotations, on the other hand, such as image-level labels or approximate scribbles, are much easier and faster to create but lack the granularity required for detailed model supervision. To address these challenges, researchers have increasingly explored alternatives to traditional supervised learning, with weakly supervised learning emerging as a promising approach. This approach mitigates annotation costs by utilizing coarse annotations (cheaper and less detailed) during training rather than the fine-grained annotations required at the output stage during testing. Despite its potential, weakly supervised learning faces challenges in transferring information from coarse annotations to fine-grained predictions, often encountering ambiguity and un certainty during this process. Existing methods rely on various priors and heuristics to refine annotations, which are then used to train models for specific tasks. This involves managing uncertainty in latent variables during training and ensuring accurate predictions for both latent and output variables at test time. This thesis introduces a unified approach to weakly supervised learning in computer vision, addressing tasks such as human pose estimation, object detection, and instance segmentation. Central to this work is a framework based on the dissimilarity coefficient loss, which models uncertainty in the location of objects and human poses using coarse annotations. The approach employs two key probability distributions:

Conditional Distribution: Captures output probabilities using coarse annotations (e.g., action labels, image-level labels, object counts), modeled with deep generative models for efficient sampling.
Prediction Distribution: Provides test-time predictions independent of coarse annotations. The framework minimizes the difference between these distributions using the dissimilarity coefficient loss, facilitating the transfer of information from coarse annotations to accurate predictions. This methodology is consistently applied across diverse computer vision tasks, showcasing its versatility. The efficacy of the proposed framework is demonstrated across three progressively complex visual scene recognition tasks:
Human Pose Estimation: A probabilistic framework is introduced for learning human poses from still images using data sets with costly ground-truth pose annotations and inexpensive action labels. By aligning the conditional and prediction distributions through the dis similarity coefficient loss, the method achieves significant improvements over baselines on the MPII and JHMDB data sets, effectively leveraging action information.
Object Detection: The framework addresses weakly supervised object detection (WSOD) by modeling uncertainty in object locations using a dissimilarity coefficient-based objective. Leveraging discrete generative models, it efficiently samples from annotation-aware conditional distributions and integrates coarse annotations, such as image-level labels, object counts, points, and scribbles. Spatial cluster regularization and curriculum learning further enhance performance, achieving state-of-the-art results on benchmarks like PAS CAL VOC and MS COCO.
Instance Segmentation: The framework models uncertainty in pseudo label generation using semantic class-aware, boundary-aware, and annotation-consistent higher-order terms. By aligning conditional and prediction distributions, it generates accurate pseudo-labels and trains Mask R-CNN-like architectures effectively. Experiments on the PASCAL VOC 2012 data set demonstrate state-of-the-art performance, with improved object boundary alignment and significant gains over baselines.

June 2025