Indic LLMs and LLMs in Education

Prof. Vasudeva Varma explores two use cases of Generative AI and LLMs – in the Wikipedia development in Indian languages and in education and learning. This article is based on key parts of his talk from the TechForward Research Seminar Series on Foundational and Large Language Models.

GenAI and LLMs are the latest buzzwords in the technology world. They are used interchangeably but in fact are two distinct technologies. GenAI can be defined as artificial intelligence capable of producing original content, like text, images, music and so on. LLMs or large language models on the other hand, are a specialised class of AI models that use natural language processing to understand and generate human-like text. Both GenAI and the LLMs have disrupted the world immensely. The world of LLMs is changing dramatically with the frantic pace of innovation. It has also caused a lot of confusion among researchers from academia on what problems to work on because LLMs seem to be solving everything. Indian LLMs or Indic LLMs hold a lot of contextual and cultural relevance and it will be interesting to see how the efforts of BharatGPT, Sarvam, Krutrim and so on will unfold.

Most of the progress in AI so far has been on the part of the iceberg that is above the water. There is yet a lot of progress to be made on the part that is below the water. The biggest opportunity for India is to look at the sectors where the Indian economy is already strong and figure out the applications of AI to those sectors that enable it to maintain its advantages. Here’s a look at the two main use cases from India where our efforts have yielded a bigger impact: Wikipedia development in Indian languages and adopting LLMs in education and learning.

Wikipedia
Just like ancient civilizations were dependent upon and developed along the banks of great rivers, our modern society too depends on a ‘river’ of knowledge flowing through our language and culture. The ‘river’ in question today is Wikipedia. The existence of Wikipedia is crucial for everything we do. But it’s not just Wikipedia in isolation; there’s an entire Wiki ecosystem comprising Wiki Data, Wiki Source, Wiki Commons and other projects within Wikipedia that leverage each other and that is the combined power and synergy of Wikipedia.

As a Cultural Tool
In its potential as a ‘knowledge river’, Wikipedia’s presence in Indian languages however is currently inadequate. While the English language has more than 7 million articles on Wikipedia, the best of the Indian languages in terms of representation on Wikipedia, Hindi has around only 150,000 articles or so. At IIIT-H, we have set out to correct this anomaly and enhance Indian language content in Wikipedia with the IndicWiki project. Creating encyclopaedic content is a very cognitively-demanding task involving a lot of research, collection of relevant material and references before the article can be penned. In addition to this, the material should conform to the ‘five pillars’ which includes writing from a neutral point of view, in a manner that anyone can use, edit and distribute the content and with the underlying premise of editors treating each other with respect and civility.

Telugu Wikipedia Content
We came up with a two-fold solution for this: One is template-based development of Wikipedia articles and the second is automated generation of these articles. The first solution starts with creating a structured manual. We invited people who can read and write in their mother tongues like Telugu, Oriya, or Hindi really well. They were also expected to have the knowledge of Python. So equipped with Telugu and Python, and with the help of the manual, they were able to create 1000s of articles in a particular domain. Multiple data sources brought together large amounts of data and within 6 weeks, we were able to create around a million articles in Telugu and about 200,000 articles in Hindi in each of those domains. While this approach was a great success in itself, we however didn’t want to stop at that. We explored the generation of encyclopaedic articles with the help of newer models and approaches using Generative AI. This is our second solution.

We looked at three sub problems to automatically create a credible Wikipedia article in Indian languages. One, we created a cross-lingual dataset of aligned facts with text. It means that for each fact that exists in Wiki Data, we have the mapping of the text in any of the wikipedia articles in any of the Indian languages. Such a data set is being used to create a model which in turn is being used to generate text in any language from any new fact that is published. So, we’ve done a cross-lingual fact-to-text-generation of encyclopaedic sentences. It gives us shorter pieces of text that are worthy of being in an encyclopaedic article.  The other problem we tried to solve was, given a set of documents on a specific topic, we explored the creation of an outline of an article that can go into Wikipedia. So essentially from the two sub modules, there is a skeleton of an article that is being created. The third problem for which we came up with a solution relates to the references that are present at the end of the articles. We tried to augment the outlines of the articles from the references. The foundational large language models could not help us much here, so we had to use an innovative and new kind of approach to generate encyclopaedic articles in Indian languages using resources coming from the Wikipedia ecosystem.

GenAI and Learning
There are experiments in the learning space too by using GenAI. This was demonstrated in a mobile app-based course developed by ISB for the Rajiv Gandhi University of Knowledge Technologies (RGUKT) for bettering the English-speaking abilities of the RGUKT students who hail from a rural background. The way it works is that a text is displayed on the screen prompting the learner to read it out aloud. As he or she reads it out, the voice is recorded and analysed.

The app then gives feedback in terms of mispronunciation, omissions from text, additions to original text and so on. It was found that by playing with the app repeatedly, the students were able to gain a lot more confidence in speaking English. Similarly there is an AI model that powers a negotiation course taught by ISB which again has become very successful. This model acts as an agent with whom the learner needs to negotiate to close a given deal. They are looking at enhancing such models and tools in other areas of management education. However, there are issues of scalability and adaptability that are being worked upon.

Middleware For Education
When there are LLMs on one side and learners on the other, one way to bridge the gap between the two is to build middleware for it. The idea is also to make the middleware useful for teachers for instructional design. The other thought is to make it a conversational tool that improves meta-cognitive skills, and enhances accountable talk – free-flowing discussions in the classroom. But along with the middleware, there come responsibility layers too. A great example lies in Khanmigo, a bot created by Khan Academy. It patiently guides the students and helps them arrive at the right knowledge levels not by giving them answers but by letting them use their own reasoning abilities in coming up with the required answers. Another example is from OpenAI where they’re currently looking for model teachers who will train the models with the right kind of behaviours. Similarly, Google’s LearnLM is a family of models fine-tuned for learning and teaching experiences. It helps in reducing the cognitive load while watching educational content, ignites curiosity, adapts to the learner and deepens metacognition. There are other new tools such as the Illuminate, which use the LearnLM model.  Illuminate breaks down research papers into short audio conversations providing an overview of key insights from the papers. One can also chat with the model and ask follow-up questions. Another tool is the “Learn About” experience. Here again, one can quiz the model and it helps guide you through any topic at your own pace. All of these are excellent examples of how foundational models can be fine-tuned to meet the demands of education and learning. However, technological innovations notwithstanding, tech will always be incident to the learning process.

No technology can ever replace the magic of a teacher, but when applied in deliberate and thoughtful ways can help augment the teacher’s capacity, giving them time to invest back in themselves.

Conclusions
The key point I want to make is that the last mile journey in the GenAI application is the hardest. While technology may enhance productivity, domain knowledge is the cornerstone to success. For example, if classroom insights from teachers and students are not captured, then we will not be able to create truly impactful learning experiences.. And finally for these foundational models, one needs to create better responsibility layers before applying or adopting in a domain.

This article was initially published in the August edition of TechForward Dispatch 

Prof. Vasudeva Varma is the Head of the Language Technologies Research Centre at IIITH. His research interests are in the broad areas of information retrieval, extraction and access and more specifically social media analysis, summarization, semantic search, text generation, and cloud computing.

  •  
  •  
  •  
  •  
  •  
  •  
  •  
  •  

Leave a Reply

Your email address will not be published. Required fields are marked *

Next post