[month] [year]

Kalakonda Sai Shashank

Kalakonda Sai Shashank  supervised by  Dr. Santosh Ravi Kiran received his Master of Science – Dual Degree  in Computer Science and Engineering (CSE). Here’s a summary of his research work on Advancing Motion with LLMs: Leveraging Large Language Models for Enhanced Text-Conditioned Motion Generation and Retrieval:

In the field of artificial intelligence, the generation of human-like motion from natural language descriptions has garnered increasing attention across various research domains. Computer vision focuses on understanding and replicating visual cues for motion, while computer graphics aims to create and edit visually realistic animations. Similarly, multimedia research explores the intersection of data modalities, such as text, motion, and image, to enhance user experiences. Robotics and human-computer interaction are pivotal areas where language-driven motion systems improve the autonomy and responsiveness of machines, facilitating more efficient and meaningful human-robot interactions. Despite its significance, existing approaches still encounter significant difficulties, particularly when generating motions from unseen or novel text descriptions. These models often lack the ability to fully capture intricate, low-level motion nuances that go beyond basic action labels. This limitation arises from the reliance on brief and simplistic textual descriptions, which fail to convey the complex and fine-grained characteristics of human motion, resulting in less diverse and realistic outputs. As a result, the generated motions frequently lack the subtlety and depth required for more dynamic and context-specific applications. This thesis introduces two key contributions to overcome these limitations and advance text-conditioned human motion generation. First, we present Action-GPT, a novel framework aimed at significantly enhancing text-based action generation models by incorporating Large Language Models (LLMs). Traditional motion capture data sets tend to provide action descriptions that are brief and minimalistic, often failing to convey the full range of complexities involved in human movement. Such sparse descriptions limit the ability of models to generate diverse and nuanced motion sequences. Action-GPT leverages LLMs to create richer, more detailed descriptions of actions, capturing finer aspects of movement. By doing so, it improves the alignment between text and motion spaces, enabling models to generate more precise and contextually accurate motion sequences. This framework is designed to work with both stochastic models (e.g., VAE-based) and deterministic models offering flexibility across different types of motion generation architectures. Experimental results demonstrate that Action-GPT not only enhances the quality of synthesized motions—both in terms of realism and diversity—but also excels in zero-shot generation, effectively handling previously unseen text descriptions. Second, we introduce MoRAG, a sophisticated retrieval-augmented generation strategy tailored to enhance the performance of motion diffusion models. MoRAG adopts a multi-part fusion retrieval mechanism that allows for improved generalization of motion retrieval across a wide range of language inputs, addressing the limitations of current retrieval methods that struggle with unseen or atypical descriptions. By incorporating low-level, part-specific motion details into the retrieval process, MoRAG constructs more accurate and varied motion sequences. The retrieval process is refined by prompting LLMs to handle issues like spelling errors, rephrasing, and ambiguous language, ensuring that the retrieved motions are contextually relevant and diverse. These augmented motion samples are then used as additional knowledge within the motion generation pipeline, enhancing the system’s ability to generate complex and realistic motions from diverse textual inputs. This retrieval-augmented approach increases both the robustness and generalization capacity of the motion generation models, making them more adaptable to complex and unseen scenarios. Together, these contributions represent a substantial advancement in text-conditioned human motion generation. By refining both the action description process and the motion retrieval strategy, this work enhances the ability of models to generate diverse, realistic motions from natural language inputs, particularly in zero-shot settings and when handling detailed, complex descriptions.

February 2025