Data is the new oil apparently. And just as with oil, we are seeing lots of shady techniques being employed to collect this data. https://time.com/6247678/openai-chatgpt-kenya-workers/
But are there better ways to collect data. What are the alternatives?
Karya, is doing something in this domain. They are trying to pay decent wages for data collection and data cleaning agents.
In this regard, Swecha, tried a very unique approach for data collection. What if we make the people part of the data collection effort. Sort of an AI by the people for the people.
So, how do we go ahead making people a part of the data collection process.
The people have to be clearly explained what the data is being used for.
The people should happily share their data.
Some incentive for sharing the data has to be provided.
The data has to be released in open source so that it can benefit the commons.
Open source is the binding force here. Because without releasing the model in open source why would anyone share their data?
To do this at scale where the data collected can be useful for AI, it needs a huge grass roots effort. And that’s where Swecha come in. The Swecha team already has experience in pulling off such open source campaigns. Swecha team was the first to launch an Indian language operating system when they launched the Telugu Ubuntu. That was also a community driven event.
So in a similar fashion, Swecha activated its volunteer network to go to different parts of the two Telugu speaking states, Andhra and Telangana and collect voice samples from the people. The goal was to collect voice samples from different accents and different locations. So, the volunteers went to villages, schools and collected data even on the road :)
The volunteers were able to collect 1.5 million voice samples amounting to around 1000 hours of Telugu ASR data. Similar efforts will cost anywhere around 50 lakhs to 1 crore, but the Swecha team was able to achieve this through a community effort.
The data was collected, but what was the incentive given to the people who shared their voice samples?
It was an entry to a concert by Ram Miryala. So you bought tickets to a concert using your voice samples instead of money. Thanks to Ram Miryala for accommodating the request and doing the concert for free.
The volunteers donated their time.
The people donated their voice.
Artists like Ram Miryala donated their talent.
The community came together to create the dataset.
Now it was the time of the technologists. My team from Ozonetel and Swecha engineers got together and started to build a Telugu ASR model from this dataset. The goal was to build a small model that can run on the mobile.
We were able to quickly build the model and to our pleasant surprise we found that it was working much beyond our expectations with accuracy more than 95% in most cases.
The output of all this effort is below.
Once the proper open source license is finalized we will be releasing the model and the datasets so that more people can innovate.