The idea is to have an entire system in place eventually so that speakers can access information or use technology in their own language by speaking, listening or typing into their phones.
These days if a Hindi speaker has to search for any content on the Internet, they can type a query in Devnagari script on their phone or just give a voice command of the same. But what about those who communicate in languages which are spoken by just a few hundred thousand people or languages that have limited to no online presence? These are the languages that Microsoft Research is helping with its Project ELLORA (Enabling Low Resource Languages) in India.
“We work on low-resource language with technology, but we take the view that these communities have some idea about their needs and desires because they are marginalised in other ways as well. So we work with them to understand their pain points out and see how technology can help,” Kalika Bali of Microsoft Research told indianexpress.com. Bali is an expert in Natural Language Processing, where linguistics and artificial intelligence come together to train computers to understand spoken and written languages. Bali explained that ELLORA’s core aim is to make sure these languages — which have very few written resources, let alone any digital presence at all — are not left behind when it comes to some of the advances that language technology is witnessing these days thanks to the use of artificial intelligence (AI) and advanced natural language models. More importantly, a digital presence could help some of these languages survive the threat of extinction.Microsoft Research (MSR) has chosen to focus on three of these for now. Gondi with close to three million speakers in Madhya Pradesh, Maharashtra, Chhattisgarh, Andhra Pradesh and Telangana, Mundari which is spoken in Jharkhand, Odisha and West Bengal, as well as Idu Mishmi from Arunachal Pradesh.
According to Bali, Gondi is where the company has done some of its longest work and worked with CGNet Swara as the partner in Chhattisgarh. CGNet Swara is an online portal that lets Gondi speakers report local news in their language via phone calls.
“We have helped out with things like Adivasi radio, which was a hub for accessing the information on the phone in Gondi. We have also been working with them to create a machine translation system because one of the biggest problems is access to information in their own languages,” Bali said.
MSR plans to test out this machine language-based translation system in the field soon and if it works well, this will let Gondi speakers access any information that is available in Hindi in their own language. In Arunachal Pradesh, MSR is working to create a digital dictionary for the Idu Mishmi language and has partnered with Pratham books.
For Mundari, MSR has partnered with IIT-Kharagpur and GIZ, the German Development Fund. In Mundari’s case the task is specific: create educational material for the children as there are very few resources available. “The idea is to create the entire pipeline. We are working on creating a text-to-speech, model which would be able to make the system talk in Mundari. We are also working on a machine translation model. In fact, we have a small machine translation model ready,” Bali said, adding that they are testing the model now and going to work on the speech recognition part as well.
The idea is to have an entire system in place eventually for Mundari so that the speakers can access information or use technology in their own language by speaking, listening or typing into their phones. Bali also stressed that for languages like Mundari, their models don’t rely on word-to-word translations. But rather they ask native speakers to translate Hindi sentences into their language, and thus create the resource and data set to feed the computer model.
One tool they developed as part of their efforts is called Interneural Machine Translation (INMT), which can help predict the next word when someone is translating between these languages, say from Hindi to Mundari. “It gives me predictive suggestions in Mundari itself. It’s like the predictive text you get in smartphone keyboards, except that it does it across two languages,” Bali explained, adding that such tools will increase the efficacy of human translators as well.
Of course, there is also the challenge of ensuring that the models work on low-end phones. Given those in marginalised communities have access to lower-end phones, the models will have to be optimised keeping this critical factor in mind. “One of the big problems is that we want is these models to work on devices like phones. We have spent a lot of time working on how to make, distil and quantise these models into smaller models that can actually work on the phone,” Bali explained.
On the current buzz around Large Language Models (LLMs) and their role in translation tools, Bali said they had tested some publicly available LLMs as well for some of their research. But getting these models to work with such languages with limited to no data sets will require more work. “It’s an open research question on how we can pivot these LLMs to work for some of the smaller languages. And, you know, the answer may lie in creating a separate layer on top of this technology. Or it may lie in actually having enough data to pump into the base models. I don’t think we are very sure. It is an open research thing to see how we do this,” she said.
For now, the ultimate aim of Project ELLORA remains clear: “That the gap between linguistically haves and have-nots does not increase further.”