Darshana S -

Darshana S supervised by Dr. Vineet Gandhi received her Master of Science in Computer Science and Engineering (CSE). Here’s a summary of her research work on Towards understanding Compositionality in Vision Language Models:

Human intelligence relies on compositional generalization: the ability to interpret novel situations by flexibly combining familiar concepts and relational structures. This thesis investigates compositionality in vision-language models (VLMs), focusing on their ability to understand and generalise across visual (images, videos) and linguistic inputs. In the first part, we introduce VELOCITI, a benchmark for evaluating compositional understanding in video-language models through a suite of entailment tasks. Unlike prior compositionality benchmarks constrained to single-agent videos, VELOCITI captures the complexity of real-world videos involving multiple agents and dynamic interactions. VELOCITI assesses how well models recognize and bind agents, actions, and temporal events using both text-inspired and in-video counterfactual negations. In the second part, we probe the internal activations of VLMs to understand how concepts in an image are bound to their attributes and references in text. Extending the Binding ID mechanism in language models, we demonstrate that VLMs construct binding ID vectors in the activations of both image tokens and their textual references, enabling in-context concept association. Together, these contributions advance our understanding of compositional reasoning in VLMs and offer tools for probing their capabilities.

June 2025