[month] [year]

ICMPC 2025

Faculty and students participated at the International Conference on Music Perception and Cognition (ICMPC-2025) held in  Sao Paulo, Brazil from 21 – 25 July. 

Dr. Vinoo Alluri gave a keynote talk on Bridging Expression and Experience: Situated Music Cognition through Digital Footprints 

 

  • Pratyaksh Gautam made a presentation on Auditory CNN Analysis: What Do Layers Encode? Summary of the research work as explained by the authors Pratyaksh Gautam, Makarand Tapaswi  and Vinoo Alluri:

Background 

Despite showing remarkable performance on tasks across modalities, deep neural networks remain opaque, with limited insight on how they internally organize information. A learned hierarchy is suggested in vision (Zhou et al., 2015) and language (Tenney et al., 2019) models in deep learning. Bregman (1990) describes the process of auditory scene analysis (ASA), where the auditory system abstracts information based on perceptual properties such as pitch and timbre, among others, in order to construct auditory streams that are then integrated to form higher order percepts like musical genre. This implies that the representation of audio goes through a series of hierarchical abstractions in the process. While hierarchical representations are indicated in human auditory processing (Kell et al., 2018), such a hierarchy has not been thoroughly demonstrated in deep learning models for audio. 

Aims 

We study how Convolutional Neural Networks (CNNs) represent the hierarchy of audio tasks, hypothesizing that representations at earlier layers of a CNN perform better at lower level tasks, and those at later layers perform better at higher level tasks. 

Methods 

On the basis of Bregman’s ASA model, we choose tasks that are arranged in a hierarchy from the domains of speech and music. We select three hierarchical musical tasks – note identification, instrument classification, and genre classification, representative of low-, mid-, and high-level tasks. We use class-balanced subsets of NSynth (Engel et al., 2017) with 1800 instances across 12 classes, Medley-solos-DB (Lostanlen et al., 2019) with 965 instances across 7 classes , and GTZAN (Tzanetakis et al., 2001) with 1000 instances across 10 classes, for each of the respective tasks. We inspect three models – VGGish (Hershey et al., 2017), CLAP (Elizalde et al., 2023), and MobileNetV3 (Schmid et al., 2023). We use k-Nearest Neighbour classifiers with class-balanced five-fold cross-validation to assess the model’s accuracy on each task with intermediate representations extracted at six equally spaced convolutional blocks and the first fully-connected layer. We repeat this for three hierarchically related speech tasks – consonant classification, keyword recognition, and speaker count estimation, using similarly class-balanced subsets of PCVC (Malekzadeh et al., 2020) with 1794 instances across 23 classes, Speech Commands (Warden, 2018) with 1750 instances across 35 classes, LibriCount (Stöter et al., 2018) with 1100 instances across 11 classes respectively. 

Results 

Our results show that the CNN’s early layers learned low-level tasks and later layers learned high-level tasks even without explicit training. On CLAP, for note identification, the first layer’s accuracy (47%) is greater than later layers (<30%). For instrument classification, we see peak accuracy at the last layer (97%), but a remarkable change occurs at the fourth layer (71%) compared to the third layer (55%). Genre classification accuracy at the last layer (75%) is better than earlier layers (<66%). These trends hold across all models. We see similar trends for the speech tasks. 

Discussion 

The results support our hypothesis, providing strong evidence that CNNs implicitly learn a hierarchy in sound. This mirrors the human brain’s hierarchical encoding and deserves further investigation if other deep learning architectures and training methodologies also encode similar hierarchies. We also see how low-level tasks are implicitly learnt.

  • Jatin Agarwala presented a poster on Empathy and Music Preferences: Exploring Valence-Arousal Patterns, and Sequential Listening Behaviors in Naturalistic Settings 

Summary of the research work as explained by the authors Jatin Agarwala; Jonna Vuoskoski, University of Oslo and Vinoo Alluri:

Background and Aims 

Trait empathy, measured by the Interpersonal Reactivity Index (IRI; Davis, 1980), influences musical preferences and emotional responses to music. Individuals high on the IRI subscales Empathic Concern (EC) and Fantasy Seeking (FS) are theorised to prefer sad music (characterised by negative valence and low arousal) as it elicits feelings of sympathy and increases listener’s absorption, respectively (Huron & Vuoskoski, 2020). We examine these hypotheses in a naturalistic setting (music streaming platforms). Also, we explore dynamic music listening patterns by examining song sequencing. 

Methods 

Data from 290 Indian university students (age M = 20.35, SD = 2.08; 211 males) included IRI scores and one year of Spotify history. Tracks played for <15 seconds were removed, and users with <500 listening events were excluded. Each event logged track name, artist, duration, and timestamp. The final dataset had 1.7M events, 136K tracks, and 86K hours. Valence and arousal (via Spotify’s energy feature) were extracted and correlated with EC and FS scores. 

Participants were divided into groups based on median splits of EC and FS scores. We performed quadrant-based analysis by dividing music into four quadrants: positive valence–high arousal (Q1), negative valence–high arousal (Q2), negative valence–low arousal (Q3), and positive valence–low arousal (Q4). Quadrant Prevalence Scores (QPS), the percentage of songs from that quadrant in the listening history, were calculated per participant. Finally, transition probabilities (TP) between quadrants were calculated based on sequential track plays. Permutation tests assessed group differences in QPS and TP. 

Results 

Significant negative correlations were observed between EC and the average valence (r = -0.15, p = 0.011) and arousal values (r = -0.17, p = 0.003) of the tracks played by the user. Overall, each group had the highest QPS in Q1 followed by Q2. Significant between-group differences were observed only in Q3. High EC (p = 0.001), high FS (p = 0.024), and high PT (p = 0.001) were associated with high QPS in Q3. High EC individuals showed higher transition probabilities from Q1 to Q3 and within Q3 (p < .05), while low EC individuals more often transitioned into Q1 from Q2, Q3, and Q4. No significant transition differences were observed for FS. PT differences emerged in transitions into Q3 from Q1, Q2, and Q4 (p < .05). Discussion This study extends music and empathy research to a naturalistic setting. The findings support our hypotheses: Individuals with greater empathic concern and fantasy seem to listen to sad music more than their low empathy counterparts. While music from Q1 is most frequently consumed, the increased chances that highly empathic individuals transition to Q3 might reflect their need for a deeper emotional engagement through music. Despite the ecological validity of using real-world Spotify data, limitations include Spotify’s coarse emotion metrics and lack of control over contextual factors like time of day or social setting. However, long-term trends may help mitigate transient effects and capture stable preferences. Future research could explore these factors using experience sampling (Larson & Csikszentmihalyi, 2014) and wearable sensors to capture real-time emotional responses to music. 

  • Sriharsha Medicherla presented a  poster on Evening-Worsening Effect and Music Choices in Depressive Individuals. Summary of the research work as explained by the authors Sriharsha M S S, Atharva Gogate, Jatin Agarwala and Vinoo Alluri:

Background 

Heggli et al. (2021) noted that circadian rhythms and individual traits influence diurnal patterns in music preference. Depressive individuals often experience more negative moods in the evening, known as the evening-worsening effect (Rusting & Larsen, 1998). As music is often used for emotion focused-coping during such mood states (Stewart et al., 2019), depressive individuals’ music consumption in the evening are unclear. 

Aims 

This study explores the evening-worsening effect in music listening, examining if depressive individuals prefer low-valence, low-arousal music in the evening. 

Methods 

A survey of 290 Indians (M=20.35yrs, SD=2.08, 211 males) collected Kessler Psychological Distress scores (K10) and Healthy-Unhealthy music scale (HUMS) scores along with Spotify listening histories over one year. Additional measures like trait empathy and life satisfaction were collected but not analysed. Participants were classified as No-Risk (K10<20, n=114, 82 males) or At-Risk (K10≥29, n=81, 52 males) of depression (Andrews & Slade, 2001). Spotify valence and energy features for top 100-250 songs over the past 4-8 weeks from the time they filled in the survey were analyzed alongside timestamps. K-Means clustering divided the day into early hours, morning, afternoon and evening, and group differences in hourly valence and energy averages were assessed using two-way ANOVA and t-tests. 

Results 

At-Risk had significantly higher Unhealthy scores than No-Risk (t=8.15, p<0.001). Significant differences in valence and energy were found between At-Risk and No-Risk only in the afternoon (valence: t=-9.47, energy: t=-6.53) and evening (valence: t=-2.98, energy: t=-11.56)(all p≤ 0.05). Two-way ANOVA showed significant main effects of time, group, and their interaction for valence (time: F(3, 40)=22.86; group: F(1, 40)=12.36; interaction: F(3, 40)=6.65) and energy (time: F(3, 40)=23.22; group: F(1, 40)=98.73; interaction: F(3, 40)=13.57)(all p<0.001), consistent across the top 100-250 songs and the past 4-8 weeks of listening histories. 

Conclusions 

This study confirms diurnal patterns in music preferences, with At-Risk individuals favouring low-valence and low-energy music, especially in the afternoon and evening. These results align with Rusting and Larsen’s (1998) evening-worsening effect, linking depressive tendencies to heightened negative moods in the evening. The findings suggest potential maladaptive listening behaviors, needing further research. 

  • Vamshi Krishna Bonagiri presented a poster on Dark Side of the Tune: Investigating the maladaptive outcomes of excessive music consumption in the age of unlimited music access. Summary of the research work as explained by the authors Vamshi Krishna Bonagiri and Vinoo Alluri:

Background 

The advent of music streaming platforms has transformed music consumption by providing instant, unlimited access to vast music libraries. While this enables accessibility, it has also promoted excessive music listening among individuals (Datta et al., 2017). While music’s positive effects on psychological well-being are well-documented, limited research examines the potential negative consequences of excessive music consumption, particularly regarding maladaptive engagement and its implications. 

Aim

To examine how excessive music consumption may lead to maladaptive outcomes and evaluate the need for music regulation. 

Methodology 

A pilot study involving semi-structured interviews was conducted with 10 young adults (mean age = 21.5, sd = 2.34) with an average listening time of more than 16 hours a week. The interviews focused on the assessment of general music consumption patterns and contextual factors; maladaptive tendencies using constructs derived from the Music Addiction Scale (Ahrends, 2022) and Healthy-Unhealthy Music Scale (Saarikallio & Erkkilä, 2015) and the investigation of attitudes toward music regulation interventions. A thematic analysis was then performed to identify disruptions caused by excessive music listening. 

Results

While participants initially emphasized music’s positive role in their lives, thematic analysis revealed three primary patterns of maladaptive engagement: Cognitive interference, lyrical content disrupted focus during complex task, some participants mentioned even instrumental music to have a similar effect; Emotional dysregulation, music sometimes reinforced negative emotional states or triggered unwanted memories; Compulsive consumption, several participants reported addiction-like behaviors, with some explicitly identifying as “addicted” and expressing concerns about withdrawal. Most importantly, several participants who initially characterized their consumption as unproblematic later acknowledged the potential benefits of regulation, particularly for improving productivity and maintaining emotional equilibrium after examining their behaviors during the interview process. 

Discussion and conclusion

Results suggest that while beneficial overall, excessive music listening may lead to unfavorable outcomes. While our pilot study using thematic analysis revealed these initial patterns, we propose future research to employ large-scale streaming data analysis and controlled studies examining excessive and maladaptive music consumption to establish music regulation frameworks for maintaining healthy listening habits.

  • Aditya Raghuvanshi presented a poster on More than Words: Music, Not Lyrics or Vocals at the Heart of Emotional Expression. Summary of the research work as explained by the authors Aditya Raghuvanshi and Vinoo Alluri:

 Background 

Music Emotion Recognition (MER) systems primarily rely on audio features (Wang et al., 2021), with recent approaches incorporating lyrics to analyze sentiment and structure for improved accuracy (Agrawal et al., 2021). Musical content, particularly melody, has been found to have a stronger capacity to convey perceived emotional expression than lyrics, although lyrics tend to enhance negative emotions more easily than positive emotions (Ali & Peynircioğlu, 2006). This variability raises important questions: Which component carries the greatest emotional information? Do listeners rely on music, vocals, or lyrics when assessing emotional content? 

Aim 

Here, we systematically analyse the contribution of music, vocals, and lyrics to perception of emotion in music. 

Methods 

For this study, we used the DEAM (Soleymani et al., 2013) and PMEmo (Zhang et al., 2018) datasets, comprising a total of 2,700 songs. The DEAM dataset includes 1,802 audio items annotated for both dynamic and static emotion using continuous valence and arousal ratings. Participants utilised a two-dimensional interface based on the Self‑Assessment Manikins (SAM) to continuously rate the emotional content at a 2‑Hz sampling rate. The PMEmo dataset contains 794 popular music choruses sourced from international music charts (Billboard Hot 100, iTunes Top 100, and UK Top 40) and was annotated using a similar interface to capture continuous dynamic ratings for valence and arousal at the same sampling rate. 

Songs were source-separated using Ultimate Vocal Remover (Takahashi et al., 2017) to isolate music and vocals, while lyrics were transcribed using OpenAI’s Whisper model (Radford et al., 2022) and verified manually. Deep learning models trained on diverse song and speech datasets predicted VA values for musical and vocal components, while lyrics were analyzed using models trained on general and lyrically annotated texts (Çano, 2017). Spearman’s correlation was calculated between predicted and human-annotated (HA) VA values. Additionally, quadrant-based (Q1: positive V, high A; Q2: negative V, high A; Q3: negative V, low A; Q4: positive V, low A) concurrency analyses evaluated prediction alignment with human-identified emotional quadrants. 

Results 

The musical component exhibited the highest correlation with HA ratings (valence: r = 0.70, arousal: r = 0.74), followed by vocals (valence: r = 0.54, arousal: r = 0.65; all p < .001), while lyrics contributed the least (valence: r = 0.11, arousal: r = 0.01). In terms of overall quadrant concurrency, the musical component showed the highest concurrency (60.41%) followed by vocals (48.84%) and then lyrics (31.45%). Specifically, the highest quadrant concurrency was observed in Q3 (musical: 90.53%, vocal: 93.53%, lyrics: 66.16%) followed by Q1 for musical (64.51%) and vocals (39.10%), and Q4 for lyrics (42.50%). 

Discussion 

Our findings emphasize the dominant role of the musical component in shaping perceived emotional expression in Western tonal music, aligning with prior work highlighting melody’s emotional salience (Ali & Peynircioğlu, 2006). Furthermore, for music signifying negative valence and low arousal (Q3), there is a higher degree of congruence between the components in conveying sadness and its related emotions. These results suggest that MER systems could prioritize musical and vocal components over lyrical content (Wang et al., 2021; Agrawal et al., 2021), as they appear to primarily drive the perception of emotional expression. 

  • Utsav Shekhar presented a poster on The Language of Protest: A Computational Analysis of Lyrical and Musical Features. Summary of the research work as explained by the authors Utsav Shekhar and Vinoo Alluri:

Background 

Protest music has historically shaped collective identity and transformed struggle into shared resistance. While prior research has examined its usage and impact (Bianchi, 2018; Mondak, 1988), little is known about the linguistic and acoustic markers that drive its perceptual distinctiveness. This study investigates whether protest songs exhibit consistent structural and expressive patterns that differentiate them from non-protest music. 

Methodology 

We analyzed 458 protest songs extracted from Wikipedia (Jiang & Jin, 2022) and 370 non-protest songs matched by time period. The following linguistic features were computed from lyrics: repetition rate (repeated bigrams/trigrams per lyrical line), lexical diversity (TTR = unique word types divided by total word tokens, reflecting vocabulary variation), unique word ratio (unique word types divided by lyrical lines, capturing how many new words appear per line), valence (emotional positivity or negativity), and rhyme density (number of rhyming word pairs per line). We also analyzed acoustic features obtained from the Spotify API: speechiness (degree of spoken-word content), danceability (rhythmic stability; a stable rhythm makes the brain follow the beats), acousticness (likelihood of acoustic over electric timbre), and instrumentalness (absence of vocals). 

Results 

Protest songs showed significantly higher rhyme density, repetition rate, greater unique word ratio, lower lexical diversity, and more negative valence than non-protest songs (all p < .001). A logistic regression classifier trained on these features achieved 90% accuracy, confirming their discriminative strength. Protest songs had significantly higher values for all acoustic features as well (p < .001). Deep learning was used to classify protest and non-protest songs; using lyrics with XLRoBERTa (94% accuracy) outperformed the audio model CLAP (89% accuracy). However, both are significant, so we can infer both the message and medium are responsible for the distinctive features of protest music. Genre may act as a confounding factor, as protest songs tend to cluster in rock, metal, country, and hip-hop, unlike non-protest songs which skew toward pop and disco. This limitation can be addressed in future work. 

Discussion 

The highly negative valence suggests that protest songs revolve mostly around negative themes. High repetition aids attention and recall. Low lexical diversity with high unique word ratio reflects figurative, chant-like repetition—where key phrases recur for impact, while new words enrich new lines. High rhyme density indicates phonetic structuring and lyrical crafting. Protest songs exhibit high speechiness, indicating a spoken-word style delivery. Their high danceability reflects stable, rhythmic patterns that facilitate collective engagement. Elevated acousticness suggests a preference for raw, organic textures. Meanwhile, high instrumentalness points to moments of vocal sparseness, where instruments alone carry emotional weight. 

Conclusion 

This study sheds light on how protest songs use structural and acoustic cues to create emotionally charged experiences. It contributes to a deeper understanding of the perceptual features of protest music and offers empirical grounding of the key distinguishing features. These findings deepen our understanding of the structural and expressive elements of protest music, providing valuable insights for research in music, linguistics, and social movements.

  • Ghosh P, Biswas A, and Sreekumar V presented a poster virtually on Modalities and Music: Is music more than what we hear? Summary of the research work as explained by the authors Ghosh P, Biswas A, and Sreekumar V:

For centuries, people have wondered: why does music make us feel things? One influential idea, called the Resemblance Theory, suggests that music expresses emotion because it mimics the way emotions look and sound in real life. For example, sadness is often shown through drooping body language or low, slow speech—and many sad songs share these same features.

But does this theory explain the full range of emotions we hear in music?

At the Memory and Neurodynamics Lab (www.mandalab.org), Pritha Ghosh (MS by research) and Aruneek Biswas (Research Associate) working under the guidance of Dr. Vishnu Sreekumar, conducted a pilot study. They put the resemblance theory to an empirical test by asking participants to identify emotions across four different formats:

  • Silent film clips (visual only)
  • Film soundtracks without images (audio only)
  • Original film clips (audio + visual)
  • Short pieces of instrumental music

Note that they used foreign-language clips to avoid the influence of language on the results. The researchers expected emotions that are strongly expressed in sight and/or sound (like sadness or anger) would also be easiest to recognize in music. Surprisingly, they found that people could also identify “amodal” emotions—like hope, affection, confusion, and determination—in music, even though these emotions were not recognized well in the audio-only or visual-only film clips. 

In fact, recognition rates in music (about 45%) were higher than expected and closely mirrored recognition in the full audiovisual clips. This suggests that music may tap into a deeper, cross-sensory way of perceiving emotion—more than just mimicking speech or facial expressions.

In short: music may express emotions that words, faces, or voices alone cannot. While the study was small and exploratory, the findings hint that musical emotion perception cannot be fully explained by the resemblance theory. Pritha and Aruneek are exploring a theoretical framework drawn from ethology, the study of animal behavior, to provide a more complete explanation of what makes some emotions easily identifiable in music compared to others. The inspiration for this approach comes from Prof. David Huron, who recently passed away. Prof. Huron was an important mentor to Dr. Sreekumar and was responsible for the genesis of this project.

August 2025