Going head-to-head with the best for machine translation of African languages
November 1, 2022 · 3 min read
Recently one of the most-used machine translation services in the world released online translation for a number of ‘low-resource’ (languages with relatively less training data available for machine learning applications) African languages including Luganda, Oromo, and Tigrinya, languages spoken in Northern and Eastern Africa, including Uganda, Kenya, Ethiopia, Eritrea, Egypt, and Somalia. Here at VoxCroft we took that as a challenge! And so we pitted our own machine translation models in those same languages in a head-to-head battle to see which version out performed the other.
Before we declare a winner, it may be helpful to understand VoxCroft’s bespoke methodology. VoxCroft builds machine translation models by using Proteus, our sophisticated crowd platform, to quickly obtain high-quality translations of sentences that are relevant to the client and their needs. All candidate translations are vetted for quality and this process ensures that all data is optimal for the specific use case of the customer since in machine learning it is well-known that relevant training data is key, see e.g. our earlier analysis of this point.
Some commercial machine translation services, in contrast to this highly targeted approach, have chosen to use a different method based on using only monolingual data for rare languages, i.e. data without any translations. When combined with bilingual translations for common languages, this allows them to build impressive single models covering large numbers of languages.
How do these models compare to our bespoke models trained on highly relevant news articles? Let’s consider three low-resource African languages to be specific: Luganda, which is spoken by 5 million people in Uganda; Oromo, which is used by about 40 million people in countries such as Ethiopia and Kenya, and Tigrinya which is spoken by 8.5 million speakers in Ethiopia and Eritrea.
For these languages, one leading commercial machine translation provider used 6 million monolingual sentences of Oromo, 4 million of Tigrinya, and 2 million of Luganda, as well as more than 10 billion monolingual sentences for English, Spanish, and German. The VoxCroft models are trained on just 80,000 high-quality translated sentences for each of the three languages without any other data. This certainly seems to frame our one-on-one showdown to David and Goliath proportions. This is where the story gets particularly interesting.
Testing all the models on exactly the same news stories shows that VoxCroft models significantly outperform the competition’s model on all metrics at the time the blog was published (We test using the BLEU, METEOR, and BERT scores.) In particular, for the BLEU score, a standard metric of translation performance, we find that training a model on just 80,000 relevant, high-quality translations outperforms their state-of-the-art system by 26%, 46%, and 52% for Tigrinya, Oromo, and Luganda respectively at the time of writing.
Of course, the typical commercial machine translation model has a very different purpose: that of providing a general-purpose translation service supporting as many people in the world as possible. In contrast, VoxCroft’s goal is to provide our clients with data and models of the highest possible quality in the specific languages of interest. Both have a place, but at VoxCroft we are excited to have developed an efficient way to make voices in these low-resourced languages heard so we can help tell the world's stories.
We include our score data below, and challenge other MT teams out in the world to join the competition. Show us what you’ve got!
BLEU score performance comparison for Oromo, Tigrinya, and Luganda where higher is better. VoxCroft’s models trained on just 80,000 sentences outperform the leading commercial model by 26%, 46%, and 52% for Tigrinya, Oromo and Luganda respectively.
Table 1: Tigrinya-To-English Translation Examples
Table 2: Luganda-To-English Translation Examples
Table 3: Oromo-To-English Translation Examples
For information on this Machine Translation services or other VoxCroft solutions, please contact firstname.lastname@example.org.