Dr. Joon Son Chung, research scientist at Naver Corp and Triantafyllos Afouras, a 2nd year Ph.D student in the Visual Geometry Group at the University of Oxford, under the supervision of Prof. Andrew Zisserman gave a talk on Visual recognition of human communications on 5 September.
Joon Son is a recent graduate from the Visual Geometry Group at the University of Oxford. His research interests are in computer vision and machine learning.
Triantafyllos is currently working on computer vision for understanding human communication, which includes lip reading, audio visual speech recognition and enhancement and body language modeling.
The objective of their research work is visual recognition of human communications, solving which will open up a host of applications such as transcribing archival silent films, or resolving multi-talker simultaneous speech, but most importantly it will help to advance the state-of-the-art in speech recognition by enabling machines to take advantage of the multi-modal nature of human communications.
Training a deep learning algorithm requires a lot of training data. They propose a method to automatically collect, process and generate a large-scale audio-visual corpus from television videos temporally aligned with the transcript. To build such data-set, it is essential to know ‘who’ is speaking ‘when’. They want to develop a ConvNet model that learns joint embedding of the sound and the mouth images from unlabeled data, and apply this network to the tasks of audio-to-video synchronization and active speaker detection. They showed that the methods developed by them can be extended to the problem of generating talking faces from audio and still images, and re-dubbing videos with audio samples from different speakers.
And they proposed a number of deep learning models that are able to recognize visual speech at sentence level. The lip reading performance beats a professional lip reader on videos from BBC television. They demonstrated that if audio is available, then visual information helps to improve speech recognition performance. They also proposed methods to enhance noisy audio and to resolve multi-talker simultaneous speech using visual cues.
Finally, They explored the problem of speaker recognition. Whereas previous works for speaker identification have been limited to constrained conditions, here they build a new large-scale speaker recognition data-set collected from ‘in the wild’ videos using an automated pipeline. And they proposed a number of ConvNet architectures that outperforms traditional baselines on this data-set.