Rudrabha Mukhopadhyay supervised by Prof. Jawahar C V received his doctorate in Computer Science and Engineering (CSE). Here’s a summary of his research work on Lip-to-Speech Synthesis:
This thesis explores the development and advancement of lip-to-speech synthesis techniques, addressing the challenge of generating speech directly from visual lip movements. Unlike text-to-speech systems that rely on explicit linguistic information in the form of text tokens, lip-to-speech synthesis aims to interpret ambiguous visual cues, presenting unique challenges in mapping similar lip shapes that can produce different sounds. Inspired by the chronological advancements in text-to-speech synthesis the research goals are broken into single-speaker lip-to-speech where a specific model is trained for each speaker with a large amount of speaker-specific data followed by multi-speaker approaches which aims to train a single model which can work for any speaker in-the-wild. The first work presented in this thesis deals with lip-to-speech generation problem in large vocabulary in unconstrained settings albeit with a model trained for particular speakers. In this work, a novel sequence-to-sequence model was introduced that leveraged spatio-temporal convolutional architectures to effectively capture the fine-grained temporal dynamics of lip movements and implemented a mono tonic attention mechanism that more accurately aligned the visual features with corresponding speech parameters. Testing on the LRS2 dataset showed a 24% improvement in intelligibility metrics over base line methods. In this work, a new dataset was released providing sufficient speaker-specific data with a diverse vocabulary of around 5,000 words to support the development of accurate, speaker-specific models. While this approach showed promise, it was obviously limited to single-speaker scenarios and failed to scale effectively to sentence-level multi-speaker tasks, necessitating further research. To address these limitations, a Variational Autoencoder-Generative Adversarial Network (VAE GAN) architecture was developed for multi-speaker synthesis in unconstrained settings with a vocabulary exceeding 50,000 words. This model was designed to overcome the inherent stochasticity in lip-to-speech mapping and handle multiple speaker identities without speaker-specific training, requiring only about 3 minutes of data per speaker compared to the previous approach of over 600 minutes of data for a particular speaker. A key contribution of this work was the use of variational autoencoders to predict separate distributions for encoding speech content and lip movements and tying them together with a KL-divergence loss. Additionally a Wasserstein GAN was also used to enhance the speech quality. Extensive ablation studies validated the architecture components, showing that the model produced more intelligible and realistic speech compared to existing approaches. However, significant quality limitations remained, with Automatic Speech Recognition tests revealing approximately 90% Word Error Rate, rendering it impractical for real-world applications. Building upon these findings and acknowledging parallel advancements in lip-to-text technologies, a third approach was developed utilizing noisy text supervision. This method integrated a state-of-the-art lip-to-text network to generate intermediate text from lip movements, followed by a visual text-to-speech network that conditioned not only on the noisy text but also on the lip movements to produce speech synchronized with the original lip movements while following the text content. The key contribution in this case was a novel cross-attention mechanism in the visual TTS module that effectively aligned the visual features with the text tokens. By addressing the synchronization challenges that would arise from a simple lip-to-text followed by text-to-speech pipeline, this approach successfully maintained temporal alignment with the original visual input. Comprehensive experimentation demonstrated consistent superiority across multiple challenging benchmarks, including LRW, LRS2, and LRS3 datasets, with the model achieving notable improvements in all speech quality metrics (PESQ, STOI, ESTOI). Human evaluations further validated these findings, with particularly strong performance in intelligibility, content clarity, and synchronization accuracy. The approach scored 3.31/5 in overall perceptual quality compared to 2.96/5 for the baseline lip-to-text + TTS approach, and achieved a Word Error Rate of 0.26 compared to 0.36 for competing methods. Detailed analysis through various ablation studies provided deeper insights into the model’s behavior. Phoneme error rate analysis revealed that the model primarily struggled with phonemes having minimal lip visibility (D, EH, K, N, and ER), an inherent limitation of visual-only approaches. Additional testing across different demographics (gender, age, race) and varied conditions (emotions, head poses) demonstrated the model’s robustness while identifying specific areas for improvement. Most significantly, this approach was successfully demonstrated on an ALS patient who could mouth words but had limited vocal cord function, generating intelligible speech with a Word Error Rate of approximately 37%. This real-world application represents the first demonstration of automatic lip-to-speech synthesis for an unseen speaker in an entirely out-of-domain scenario, highlighting its transformative potential for assistive technology applications. The research extended beyond basic lip-to-speech synthesis to explore practical applications, particularly in Audio-Visual Speech Enhancement. Two interconnected problems were investigated: audio visual speech super-resolution and audio-visual speech denoising, both conceptualized as extensions of lip-to-speech synthesis incorporating additional noisy audio inputs. This exploration demonstrated how integrating lip movement data with traditional speech processing techniques could significantly improve speech signal quality and intelligibility in challenging environments where conventional audio only methods fall short. The visual cues from lip movements were leveraged to reconstruct and augment low-quality audio data, achieving higher-resolution speech output and more effective noise reduction in heavily contaminated environments. Throughout all approaches, extensive experimentation was conducted to optimize model parameters. Resolution analysis revealed that 96×96 pixel inputs provided the optimal balance between performance and computational efficiency, with higher resolutions (256×256) showing decreased performance due to increased computational complexity and noise sensitivity. The studies demonstrated that temporal modeling capacity was more critical than spatial resolution for accurate lip-to-speech synthesis. All models were rigorously evaluated using multiple objective metrics (PESQ, STOI, ESTOI, and WER) and subjective listening tests, with particular attention to their potential in assistive technology applications. Overall, this thesis contributes significant advancements to the field of lip-to-speech synthesis across single-speaker and multi-speaker domains, progressively addressing limitations of each approach and establishing new benchmarks in this rapidly evolving field. The research demonstrates the potential for creating more accessible assistive technologies for individuals who retain lip mobility despite speech impairment, as well as applications in media enhancement, silent communication interfaces, and audio visual processing systems.
December 2025

