
Prof Solomon Gizaw
Solomon Gizaw gave us a fascinating look into gender bias in natural language processing, using Amharic, the official federal language of Ethiopia, as the main case study. The big question behind the talk was simple but powerful: can language data itself carry gender bias before an AI model even starts making decisions?
The answer was yes. Gizaw explained that many AI systems rely on large collections of text, called corpora. These texts often come from society through websites, news, public documents and other written sources. But because society itself can carry gender bias, the language data collected from it can also reflect those same patterns.
The wow factor here is how words can be turned into mathematics. Through word embeddings, words are represented as vectors, allowing computers to measure relationships between them. This makes it possible to capture patterns such as king – man + woman ≈ queen, while also revealing biased associations in the data. The same method can expose harmful links, such as men being placed closer to roles like programmer, engineer or scientist, while women are placed closer to domestic roles.
Gizaw showed how his team trained word embedding models on an Amharic corpus of more than 1.6 million sentences. They used gender-related words such as equivalents of “he”, “she”, “man”, “woman”, “father”, “mother”, “son” and “daughter”, then compared these with professional terms such as doctor, lawyer, scientist, engineer, nurse and maid. The results showed that many professions were male-leaning, while female-leaning terms were often linked to domestic work or jobs requiring lower levels of formal education.
This matters because if an AI system is trained on biased language data, it can reproduce biased outcomes in real-world applications. A model may appear mathematically objective, but the data it learns from may already contain social inequalities.
The talk also highlighted why this work is especially important for African languages. Amharic has a rich grammatical structure, where words can change according to gender, number, tense and other features. This means bias detection cannot simply be copied from English or other widely studied languages. It needs local linguistic knowledge, cultural context and careful technical evaluation.
The key takeaway from Gizaw’s seminar was clear: before building AI systems, we need to check the data first. If the data is biased, the model may also become biased.
For Quantum@SUN, this seminar was a strong reminder that computation is not only about algorithms and accuracy. It is also about responsibility. AI does not only learn language — it can also learn society.
And that is why we must teach it carefully.
For more information, follow this link.