Public accessibility to the digital archives has not only been enhanced via a search engine but also made inclusive thanks to its integration with audiobooks. Next step: Language divide to be bridged with multi-lingual translation.
The search engine which was developed in collaboration with the International Institute of Information Technology, Hyderabad under the guidance of Prof. Gurpreet Lehal, Consultant, Punjabi University, and Prof. C.V. Jawahar, IIITH, along with Punjabi University, Patiala and C-DAC, Noida is an initiative of the National Language Translation Mission, Bhashini.
“The debates, dating back to 1947 had been digitised in 2023 as part of the Punjab Digital Library project, which aims to preserve the cultural heritage of Punjab. But those PDFs were not searchable images,” notes Prof. Lehal, adding that oftentimes each PDF had three different languages – English, Hindi and Punjabi, written in their distinct scripts of English, Devanagiri and Gurmukhi respectively. “The first challenge was to develop an OCR that could recognise the appropriate script and then convert it into text with high accuracy. The next was to make them searchable so that anyone who wanted to go through historical debates could retrieve the appropriate text. For instance, if I type ‘Punjabi Suba’ – the movement that ultimately led to the creation of Haryana – in Hindi, the engine will search through the two lakh page-database and pull out all references to the movement in the three languages,” he says.
Accessibility and Inclusion
Some of the unique features of the search engine include the ability to handle fuzziness in the search criteria such as similar-sounding words or names. “Suppose you are typing in ‘Prakash Singh Badal’, you could type it as ‘Parkash’ as well and the engine will auto-correct for minor spelling errors and retrieve the correct output. Essentially, it reveals insights and fosters accountability in governance when one can search for and retrieve all topics debated by any MLA, along with their frequency of participation and so on,” states Prof. Lehal. Another inclusive feature is that the legislative archives have been made accessible to the visually impaired by converting them into audio books. Krishna Tulsyan, researcher at IIITH who is part of Bhashini’s efforts to convert Indian language books into audiobooks says, “We use consortium OCR to extract unicode text from the PDFs and then use the Bhashini TTS to convert the text to speech that can either be played on-the-spot in the application itself or downloaded as a reader-compatible format like mp3 or Daisy.”
Text and Audio Translation
“The ultimate aim is to make the legislative archives accessible in all Indian languages, so that if a debate is in Punjabi, it can be made available in say, Marathi, to a native speaker of Marathi,” states Prof. Lehal, referring to Phase 2 of the project. According to him, conversion of textual matter into Unicode has helped lay the foundation for all other language services such as search, translation, conversion to speech and so on. “With Unicode conversion, integrating the search engine with Bhashini’s machine translation system will become very easy,” he says. Similarly, the availability of audiobooks in any of the Indian languages will enhance digital accessibility of these archives. Additionally, integration with a Large Language Model could enable intelligent, conversational search capabilities. “Users might be able to ask questions in natural language—such as ‘What were the key discussions on agricultural reforms in 1980s; or ‘Compare political stances on Punjabi Suba across party lines’—and receive context-aware, summarized responses,” describes Prof. Lehal.
With Punjab leading the way in transparent governance, it is only a matter of time before the other 31 Vidhan Sabhas follow suit.

Sarita Chebbi is a compulsive early riser. Devourer of all news. Kettlebell enthusiast. Nit-picker of the written word especially when it’s not her own.
Next post