[month] [year]

Madhavaram Vivek Vardhan

Madhavaram Vivek Vardhan supervised by Dr. Charu Sharma  received his Master of Science in Computer Science and Engineering (CSE). Here’s a summary of his research work on Towards a Training Free Approach for 3D Scene Editing:

Imagine a situation where an interior designer is working on a new project and plans to add a few objects in the home to enhance its look. He made all the purchases and now the objects are not fitting into the designed environment. Should he still use them? Instead he can check the suitability beforehand by placing them in the environment virtually and then buy if he is satisfied. Also, consider a scenario in a game where a player is surrounded by enemies approaching from all directions. The player must strategically place obstacles to hold off the enemies, but has no assistance in this task. Can the player survive under these conditions? Can he perform these actions just by voice control or text instruction? Motivated by these ideas, we devised a method to modify the scene based on the text prompt in this thesis. Text driven diffusion models have shown remarkable capabilities in editing images. These edits include substituting the objects in the scene, inserting new instances, deleting unwanted things and changing the textures. Execution of these edits is quick in images and does not need much guidance. However, when editing 3D scenes, existing works mostly rely on training a Neural Radiance Field (NeRF) for 3D editing. Recent NeRF editing methods leverage edit operations by deploying 2D diffusion models and projecting these edits into 3D space. They require strong positional priors alongside the text prompt to identify the edit location. In absence of this guidance, edits are decentralized from target location and are diverse in each view. When the changes are different in multiple views, 3D reconstruction becomes inaccurate, changing the overall aesthetics of the scene. These methods are operational on small 3D scenes and are more generalized to a particular scene. They require training for each specific edit and cannot be exploited in real-time edits as each training cycle requires ample amount of time. These limitations make NeRF approach a gridlock for real-time edits. To address these limitations, we propose a novel method, FreeEdit, to make edits in training free manner using mesh representations as a substitute for NeRF. Training-free methods are now a possibility because of the advances in foundation model’s space. We leverage these models to bring a training-free alternative and introduce solutions for insertion, replacement and deletion. We consider insertion, replacement and deletion as basic blocks for performing intricate edits which are certain combinations of these operations. Given a text prompt and a 3D scene, our model is capable of identifying what object should be inserted/replaced or deleted and the location where the edit should be performed. We also introduce a novel algorithm as part of FreeEdit to find the optimal location on the grounding object for placement. We evaluate our model by comparing it with baseline models on a wide range of scenes using quantitative and qualitative metrics and showcase the merits of our method with respect to others. We hope this work motivates future research on text driven 3D scene edits using mesh representation. As this is an initial step towards a training free approach for scene edits, adding a few complex components can enhance the user experience that drives the research here after.

February 2025