X
    Categories: Applied research

ASR makes the world your oyster

IIITH is poised to disrupt Automatic Speech Recognition (ASR) in India by building a six language multi-lingual system by August 2021. Prof. Anil Kumar Vuppala from IIITH’s Speech Processing Lab helps deconstruct this emerging industry.

Industry pundits forecast that the Indian speech domain is destined to be a multi-million-dollar market, with international conglomerates lining up to dive into its uncharted acoustic waves. The multi-lingual nation of India with her 22 major languages and innumerable dialects still uses English as the lingua-franca. But despite the different orthographies, Indic languages have a big advantage. Our Dravidian and Indo-Aryan languages share a common phonetic space! This is the magical Open Sesame that has paved the way for building a multi-lingual ASR using a common phone set.

Dial me intrigued! What is a Phone set?

To recognise speech, a word is first converted into an acoustical signal. A Phone set is nothing but a sound. When you want to utter a word, your lungs pump out air via the vocal tract. “What we do is to split the word in the manner in which we pronounce it and transcribe it into the corresponding acoustic sound biologically, that is then given a phonetic representation. For a word, its corresponding phonetic representation is called a phone sequence”, explains research scholar Ganesh S. Mirishkar. “We train the neural network, much like a child is taught the alphabet”. Mapping is done between the wave format and the matching transcript. This is fed to the neural network which is essentially a mathematical operation which trains the system.

Aah! Indic languages are Syllabic

Many Indian words have the diphthong aa in it. Indian languages are essentially syllabic in nature, a combination of two or three phones whereas English is phone-based. “The sounds in all languages are the same which led us to the hypothesis of using a common label set for Indian languages. The orthography is converted into Latin. We have found a pronunciation dictionary and fed it into a neural network that was learnt”, said Mirishkar. This is factored for 11 of the major languages spoken in India. As of now, IIITH has built a framework for 3 languages in a single block – Telugu, Tamil and Gujarati.

A Joint Acoustic model approach

To break it down to basics, ASR has three modules that starts with feature extraction from input signals. Firstly, the sounds are trained in the acoustic model. Then, the speech background works on speech space and acoustic model space. Finally, the language model works on the prediction of the words and trains the corpus accordingly.

Traditionally in multi-lingual ASR, the signal input was passed through a language identification (LID) block that would identify the language. The team’s idea was to remove the LID block and instead build a unified Joint Acoustic model where the language will be learnt by itself and then processed to get the corresponding transcript. IIITH is currently working on a multi lingual model with nine languages, Hindi, Marathi, Urdu, Bengali, Tamil, Telugu, Kannada, Malayalam and Gujarati.

Of Data, Orthographies and Code-Mixing

A speech recognition system requires thousands of hours of properly labelled data. One key learning from the research is that a neural network learns from inputs of a data rich language, that can then be effectively utilized by resource-scarce languages. In spite of the different orthographies, Indian languages share a common lexicon and vocabulary. “To make up for the scarcity of data, we decided to combine languages and to build an ASR to solve the data scarcity and code mixing problem”, explained Prof. Vuppala. Hence, with 60 hours of Tamil and Telugu data and only 10 hours of Kannada data, one can still build a good Kannada ASR.

Training acoustic and language models to deal with language combinations like Hinglish or Tenglish has the benefit of code mixing. “For instance, in the statement Mein iPhone kharid raha hoon, you are switching between two languages Viz., L1 and L2. L1 is a native language. L2 is English. If L2 is borrowed from L1, that is called code mixing. In this instance, we postulate that you can collect proper mono-lingual systems from Telugu, Tamil, Hindi and English and build a multi lingual system around it.” The thought process started 4-5 years back, when Dr. Hari Krishna Vydana, one of Prof. Vuppala’s first students started working on ASR at the IIITH Lab at the Language Technologies Research Center.

A mono-lingual format for Telugu

Come August 2021, IIITH will be the first laboratory in India to have a mono lingual system for Telugu language with 2000 hours of corpus. Data will be crowd sourced from different backgrounds, from the regions of Andhra, Telangana and Rayalaseema, each with its own distinctive dialect. The target study are adults in the age group of 18 -50 years; the reason being that their vocal track sounds are constant. Dr. Vishnu Vidyadhar Raju, an alumnus of IIITH (2020) is planning to extend this technology to a commercial purpose by using emotion recognition to analyse voice analytics in call centres.

Unravelling contextual quandaries

How does the system recognise a word eg: weather and whether, where the pronunciation is the same? “When we train the system, we categorize word endings and even train silence”, explains Mirishkar. The Language model predicts the next word. “We rely on the acoustic model that generates the word and we assign probability scores and values in the language model and the highest probability will be taken”.

Avenues for commercialization are especially strong in geographies like India, Europe, SEA and Africa that offer greater opportunities for multi-lingual vernacular voice search for multiple needs.  The speech domain could be the next gold rush and industry headliners from Google to Amazon, Microsoft and IBM are investing heavily in Indian languages in the ASR domain, chiefly due to the initiatives taken by the Modi government in e-governance and Digital India. That is the power of Automatic speech recognition (ASR), the technology of tomorrow that IIITH is revolutionizing!

Online testing  of ASRs are available in https://asr.iiit.ac.in/

Deepa Shailendra is a freelance writer for interior design publications; an irreverent blogger, consultant editor and author of two coffee table books. A social entrepreneur who believes that we are the harbingers of the transformation and can bring the change to better our world.

Deepa Shailendra :Deepa Shailendra is a freelance writer for interior design publications; an irreverent blogger, consultant editor and author of two coffee table books. A social entrepreneur who believes that we are the harbingers of the transformation and can bring the change to better our world.