Gaurav Singh -

Gaurav Singh supervised by Prof. K Madhava Krishna received his Master of Science – Dual Degree in Computer Science and Engineering (CSD). Here’s a summary of his research work on Learning Improved 3D Representations for Robotic Perception:

Understanding 3D structure from sparse visual observations is a foundational capability required for robotic perception. In many real-world settings, objects are observed in arbitrary poses, under occlusion, and from limited viewpoints. Inferring complete 3D shape and appearance from such inputs enables downstream tasks like robotic manipulation, scene understanding, and 3D generation. This thesis explores representation learning methods that operate over both explicit and implicit 3D representations to recover structured geometry from minimal input, enabling category-level generalization across unseen instances in challenging conditions. To enable robust 3D inference in such scenarios, we first focus on the challenge of completing object geometry from partial point cloud observations when objects appear in arbitrary orientations. Traditional shape completion methods assume inputs are aligned to a canonical frame, which limits applicability in robotics where objects may be present in non-canonical poses. To address this, we propose SCARP—a method for Shape Completion in Arbitrary Poses. SCARP disentangles shape and pose using a multi-task formulation; it learns rotation-equivariant features for pose estimation and rotation-invariant features for shape reasoning. The network jointly predicts a canonical full point cloud and its 6D pose. Unlike multi-stage pipelines that depend on external canonicalization, SCARP is a single network that learns to complete shapes directly in the observed pose. We demonstrate SCARP’s utility in robotic grasping by showing that completing shapes before grasp prediction reduces invalid and colliding grasps by over 70%, and outperforms prior methods by 45% on completion metrics across several categories. While point clouds offer a compact and geometric view of 3D structure, they are inherently sparse and do not capture appearance or high-frequency detail. To model richer, dense 3D structures directly from images, we turn to implicit neural fields—specifically, Neural Radiance Fields (NeRFs). We introduce HyP-NeRF, a framework for learning category-level priors over Neural Radiance Fields (NeRFs). Our key insight in HyP-NeRF is to use a hypernetwork to not just parameterize a neural network (NeRF) but also generate learnable input encodings, resulting in learning higher frequency details while reducing the computation cost. The hypernetwork predicts both the NeRF MLP weights as well as parameters of a multi-resolution hash encoding (MRHE) of a NeRF conditioned on an instance code, enabling efficient synthesis of NeRFs across unseen objects in a class. To overcome rendering artifacts and improve fidelity, we propose a denoising and finetuning pipeline that leverages a learned 2D denoiser followed by view-consistent NeRF optimization. This formulation supports diverse downstream tasks, including single-view NeRF generation, text-to-NeRF synthesis via CLIP, and retrieval from real-world images. Experiments on high-resolution (512×512) object datasets show that HyP-NeRF achieves state-of-the-art performance in generalization, compression, and instance retrieval, while maintaining photorealistic quality and fast inference. In summary, this thesis presents methods for learning robust and generalizable 3D structure from sparse inputs. SCARP addresses the challenge of shape completion at arbitrary poses through rotation aware modeling of partial point clouds, while HyP-NeRF develops scalable priors over neural fields for instance-conditioned NeRF generation. Together, these approaches offer complementary pathways toward unified, learnable 3D perception in robotic environments. Moreover, the ideas and methods of representation learning presented in these papers have the scope of being scalable and being integrated in future 3D foundation models which require the sufficient inductive biases to be able to account and compensate for the limited availability of training data.

June 2025