Mohd Hozaifa Khan supervised by Dr. Santosh Ravi Kiran received his Master of Science in Computer Science and Engineering (CSE). Here’s a summary of his research work on Sketchtopia: A Dataset and Foundational Agents for Benchmarking Asynchronous Multimodal Communication with Iconic Feedback:
Human communication is inherently multimodal, asynchronous, and goal-driven, relying on a fluid interplay of speech, gesture, and sketching to establish shared understanding. While Artificial Intelligence has made substantial progress in static vision and turn-based dialogue, the challenge of modeling continuous, real-time collaboration remains largely unaddressed. This thesis introduces SKETCHTOPIA, a comprehensive framework designed to benchmark and investigate these complex dynamics. First, we present the SKETCHTOPIA dataset, a large-scale corpus of over 20,000 collaborative Pictionarystyle sessions. Unlike prior resources, this dataset captures the complete temporal evolution of interaction, including 263,000 vector strokes, 56,000 timestamped guesses, and 19,400 iconic feedback events, thereby enabling the study of communication as a continuous process rather than a discrete exchange. To quantify this process, we introduce a novel suite of evaluation metrics, including the Feedback Responsiveness Score (FRS) and Multimodal Action Timing Similarity (MATS), which measure the “collaborative rhythm” and responsiveness of an agent, moving beyond simple task success rates. Second, we propose and implement foundational AI agents capable of operating within this asynchronous environment. We introduce a novel hierarchical architecture centered on the ACTIONDECIDER module, which decouples low-latency state monitoring from high-level action execution. This allows agents to break free from rigid turn-taking and engage in proactive, event-driven interaction. We develop DRAWBOT, a generative agent using conditioned diffusion models for incremental sketching, and GUESSBOT, a retrieval-based agent that emulates human interpretation strategies. Finally, we conduct an empirical evaluation comparing these agents against human baselines and state-of-the-art Vision-Language Models (VLMs). The results highlight role-specific differences in agent performance: while GUESSBOT achieves win rates close to those observed in human–human sessions (81.7%), DRAWBOT exhibits difficulty in communicating abstract concepts, achieving a 30% success rate on adjectives. We further observe that evaluated general-purpose VLMs face practical limitations in this setting due to their turn-taking nature, inference latency, and verbosity. Overall, this thesis establishes an experimental foundation, dataset, and analytical tools for studying dynamic, feedback-rich collaboration, to enable more systematic investigation of asynchronous multimodal interaction in human–AI systems.
January 2026

