Researchers demonstrate how a wireless stethoscope converts behind-the-ear vibrations – heard as non-audible whispers- into intelligible speech even in a ‘zero-shot’ setting.
Stephen Hawking knew the importance of being heard. He wrote in his memoir, ”One’s voice is very important. If you have a slurred voice, people are likely to treat you as mentally deficient.” While his own speech synthesiser worked by converting letters that he chose from a computer screen into speech – at first with the help of a hand switch and later by twitching his cheek, researchers at IIITH have experimented with a silent speech interface (SSI) that can convert non-audible speech into a vocalised output.
The team led by Neil Shah, TCS researcher and PhD student at the Centre for Visual Information Technology (CVIT), IIITH, along with Neha Sahipjohn and Vishal Tambrahalli, under the guidance of Dr. Ramanathan Subramanian and Prof. Vineet Gandhi, has published its findings in a paper titled, “StethoSpeech: Speech Generation Through a Clinical Stethoscope Attached to the Skin”. The paper was presented at the prestigious UBIcomp/ISWC – an interdisciplinary conference that invites papers published by the proceedings of ACM on Interactive, Mobile, Wearable, and Ubiquitous Technologies – in Melbourne, Australia during 5-9 October 2024.
Traditional SSI
SSI, as Neil explains is a form of communication where an audible sound is not produced. “The most popular and simplest of SSI techniques is lip reading,” he says. Some of the other SSI techniques include Ultrasound Tongue Imaging, real-time MRI (rtMRI), Electromagnetic Articulography, Permanent Magnet Articulography, Electrophysiology, Electrolarynx, and Electropalatography, where vibrations across the vocal folds are analysed to comprehend articulation. According to the researchers, these techniques fall short due to their extreme invasive nature (think of coils attached to the lips and tongue for measuring movement) and the fact that they don’t work in real-time.
What They Did
The aim of the innovation was to improve social interactions for those with voice disorders. For this, the team used an off-the-shelf stethoscope attached to the skin behind the ear to convert behind-the-ear vibrations into intelligible speech. “Such vibrations are referred to as Non-Audible Murmurs (NAM)”, says Prof. Gandhi. The IIITH team curated a dataset of NAM vibrations, which they’ve labelled as the Stethotext corpus, collected under noisy conditions such as an everyday office environment, as well as high-noise scenarios, the kind experienced in a concert. These vibrations were paired along with their corresponding text. “We asked people to read out some text – all while murmuring. So we know the text and we captured the vibrations. In this way, we trained our model to convert the vibrations into speech,” says Prof. Gandhi.
How Is It Unique
What sets IIITH’s research apart from previous approaches is that it uses a very minimalistic design of an ordinary stethoscope. NAM vibrations are transmitted via the stethoscope onto a mobile phone through bluetooth and clear speech is obtained as output on the phone speaker. In earlier versions, speech conversion took place only on the assumption that paired whisper-speech data is available. “We demonstrated that NAM vibrations into speech can happen even in a ‘zero-shot’ setting, which means that it works even for novel speakers whose data has not been used for training the model,” explains Neil. Additionally, vibrations to text conversion happens in real time. According to Neil, “Translating a 10 second NAM vibration takes less than .3 seconds.” The researchers also demonstrated that the translation works well even when there is movement such as when the user is walking.
What’s also unique about this solution is that users can exercise options for the kind of output speech they want to be heard in. For instance, they can choose ethnicity, say English spoken with a pronounced South Indian accent, gender; male or female voice and thus speech can be produced accordingly. “We’ve also demonstrated through this research that we can build person-specific models,” remarks Prof. Gandhi. It means that with just 4 hours of murmuring data recorded of any person, a specialised model that converts NAM into speech can be built just for that person. “Other researchers have also converted whispers into speech but we were able to get great accuracy in our output,” says Neil.
ML for Accessibility
For Prof. Gandhi and his team, the first foray into the accessibility space began with their experiments on text-to-speech (TTS) models. “We have a very good text-to-speech system known as Parrot TTS,” says the professor, explaining how they converted lip movements into clear speech. While the research itself in the TTS space is not entirely novel, what sets Prof. Gandhi’s model apart is the quality of the output. “Most ML algorithms directly convert text into speech but that’s not how humans learn to speak. Newborns interact with audio, listen to the world around them and directly start speaking before learning to read,” says Prof. Gandhi. In order to mimic the natural way of speaking, the team first built a speech-to-speech system. In the second stage, they mapped the sound representation to text, instead of directly going from text to speech which is how other ML models work. The major advantage of this is that one can make any speaker ‘speak’ in any language. With the goal of giving a voice to those who can’t speak, the first solution they worked on was converting lip-to-speech. Now with the conversion of whisper-to-speech, they’re working on creating the best model that can give a voice to the speech-impaired.
Significant Breakthrough
For the wireless stethoscope, one other use case that the team has demonstrated is communication in high-noise environments like a rock concert where even normal speech is unintelligible. The researchers also mention that it can come in handy to decipher discreet communication typically used by security guards like the Secret Service and others.
“Our work is a game changer in the sense that all previous studies have assumed that a clean speech is available corresponding to the vibrations one is recording. But If someone is disabled or is speech impaired, we won’t have his corresponding speech. That is the fundamental difference in our case – we don’t assume that clean speech of a speech impaired person is available in order to train our models. The other thing is that the previous works were rather experimental in nature. The output is nowhere close to the kind of performance our models are demonstrating in terms of clean speech,” emphasises Prof. Gandhi. While the team has not conducted any experiments on medical patients yet, they are actively looking for collaborations with hospitals to record data from patients. “At this point, it’s super exciting to think that we can give a voice to someone who has lost his own,” muses Prof. Gandhi.
For more information about the research, click here.