Building Chandamama Kathalu
"Language is not a mere string of words. It has a suggestive power well beyond the lexical meaning"-Ngũgĩ wa Thiong'o
When we were first pitching for volunteers to help us build the dataset for Chandamama, one of the first questions I was asked was, “Why? Why do we need a Telugu LLM? Isn’t ChatGPT with English good enough”
I thought it was a very good question. The most important questions is always “Why”. We can answer questions like What is life, How is life formed etc. But answering Why is life formed takes it into the philosophical domain. Same here. Why should we spend so much time and effort and money to try to build an LLM in Telugu language.
Rather than philosophy, I will write a couple of quotes which resonated with me and also answers the question of Why?
"Language is not a mere string of words. It has a suggestive power well beyond the lexical meaning" -Ngũgĩ wa Thiong'o
“Learning for a colonial child became a cerebral activity. Not an emotionally felt experience”-Ngũgĩ wa Thiong'o
The first quote explains about the suggestive power of language and no matter how great you are at English, the suggestive power of your mother tongue will always be much more than that of English. That’s because you learn aboout the natural world in your mother tongue. And that is why for me
పదండి ముందుకు ,
పదండి త్రొసుకు,
పొదాం పై పై కి,
Will always be more powerful than
Go forward,
Go pushing,
Lets go higher.
The second quote is much more impactful. It finally puts to words exactly what I felt learning in an English medium school. It was not an emotional experience. It was a cerebral activity. Too many neurons were used to convert the medium of communication into my language and then understand the concept. I became better at doing this as I grew up, but I will always be a colonial child unless I am free of the colonial language.
I am beyond saving, but thanks to AI, I believe that future generations can choose the language of their choice and people will not be discsriminated based on what language they know best.
With this in mind we set about trying to build a small LM in telugu to showcase a proof of concept of the pipeline. We proved 3 things during this exercise.
The power of students in India is immense. It just has to be guided in the right direction. This was evident in the fact that we planned and executed getting 10,000 students from various colleges onto a single platform and making them create 40,000 stories in 4 hours. We were pleasently surprised by the success of this. Now we are planning multiple datathons this year to create many more open source datasets.
Creating open datasets is the way to go. By creating the dataset and releasing it as open source the reach is that much higher. Now all academy, startups, IT industry, Government can use this dataset and build models as per their needs. This also proved that small language models are good enough for specific tasks.
Love for local languages is immense. Just the fact that this whole exercise took around a month and seeing how people from different industries got together to create and launch this model gives us great hope that almost everyone wants local language models to succeed. A side effect of this experiment was the renewed interest in Telugu stories and renewed interest in Chandamama.
Thanks to the success of the Chandamama Kathalu experiment we will now be seeing a lot more activity in the open source AI movement. Gaurav Raina(Garuav was one of the initial idea creator for Chandamama Kathalu) and team from IIT-M will now take this experiment pan India.
Chandamam is not just a word. It’s an emotion.