[month] [year]

Recognition at EMNLP conference

At the 30th Annual Conference on Empirical Methods in Natural Language Processing (EMNLP 2025), held at  Suzhou, China on 7 November, a team of IIIT Hyderabad researchers received the Special Area Chair (SAC) highlights recognition for their research work on Aligning Text/Speech Representations from Multimodal Models with MEG Brain Activity During Listening. The conference was organized by the Association for Computational Linguistics (ACL). 

The authors of this research work are  Padakanti Srijith; Khushbu Pahwa,  Rice University, USA; Radhika Mamidi, Bapi Raju Surampudi, Manish Gupta, Microsoft, India and Subba Reddy Oota, Technische Universität Berlin, Germany.

Here is the summary of the paper as explained by the authors:

Although speech language models are expected to align well with brain language processing during speech comprehension, recent studies have found that they fail to capture brain relevant semantics beyond low-level features. Surprisingly, text-based language models exhibit stronger alignment with brain language regions, as they better capture brain-relevant semantics. However, no prior work has examined the alignment effectiveness of text/speech representations from multimodal models. This raises several key questions: Can speech embeddings from such multimodal models capture brain-relevant semantics through cross-modal interactions? Which modality can take advantage of this synergistic multimodal understanding to improve alignment with brain language processing? Can text/speech representations from such multimodal models outperform unimodal models? To address these questions, we systematically analyze multiple multimodal models, extracting both text- and speech-based representations to assess their alignment with MEG brain recordings during naturalistic story listening. We find that text embeddings from both multimodal and unimodal models significantly outperform speech embeddings from these models. Specifically, multimodal text embeddings exhibit a peak around 200 ms, suggesting that they benefit from speech embeddings, with heightened activity during this time period. However, speech embeddings from these multimodal models still show a similar alignment compared to their unimodal counterparts, suggesting that they do not gain meaningful semantic benefits over text-based representations. These results highlight an asymmetry in cross-modal knowledge transfer, where the text modality benefits more from speech information, but not vice versa.

Code made publicly available: https://github.com/srijith9862/MEG_multimodal 

November 2025