AI expert pioneers breakthrough methods for African language data collection

Dr. Ogun during a research presentation

In a significant contribution to the field of Artificial Intelligence, Dr. Sewade Ogun, whose innovative filtering system has been adopted by research teams across three continents, is addressing one of AI’s most pressing challenges: the critical shortage of African language data.

“We need data to train our own speech recognisers and language models,” explained Ogun, whose groundbreaking work in data curation has earned recognition from both academia and industry leaders. “The disparity between resource-rich languages like English and typical African languages directly impacts AI performance. Data is indeed the new oil.”

Ogun’s reputation in the field stems from his development of a revolutionary filtering system that has transformed how researchers approach dataset quality – an innovation now cited in several research papers and implemented by AI labs in Europe, North America, and Asia. His method, which employs deep-learning-based quality estimation models, has become the standard first-stage approach for filtering large corpora of low-quality recordings.

“The breakthrough came after experimenting with numerous complicated filtering methods,” Ogun recounts. “I conceptualised a fundamentally different approach, and the results were exceptional.” This methodological innovation has dramatically improved training efficiency for speech recognition systems, reducing processing time by up to 40% while increasing accuracy by 15-25% for low-resource languages.

His contributions extend beyond methodology to actual dataset creation. Ogun has personally developed and published the most comprehensive Yoruba speech dataset currently available for AI research – a collection now used by researchers at seven major universities and three leading technology companies for training speech recognition and synthesis systems.

At a recent presentation to machine learning specialists and data scientists in Lagos, Nigeria, Ogun emphasised the critical importance of community-driven data collection efforts. “Platforms like Common Voice support recordings in many African languages, but the scale remains insufficient,” he noted, referencing the success stories of Kinyarwanda and Kabyle – each with over 100 hours of recorded data and more than 2,000 community contributors.

“If Africa can prioritise systematic data curation now, the future of AI applications will be transformative,” Ogun stated. “We have numerous challenges well-suited to AI solutions, but without addressing the fundamental data gap, progress will remain limited.”

His pioneering approach to quality estimation in speech datasets has established a new paradigm in the field – what many researchers in the field describe as “the Ogun’s MOS Method”. This recognition reflects Ogun’s standing among the top tier of experts specialising in resource-constrained AI development.

For researchers and technology developers working on African language applications, Ogun’s innovations have created entirely new possibilities. His methods have been implemented in speech recognition systems for five previously unsupported African languages, potentially benefiting millions of speakers previously excluded from voice-based technologies.

Join Our Channels