How Mozilla is crowdsourcing speech to diversify voice recognition

This piece was posted to The Next Web on August 27, 2019.

The immediate future of human-machine interaction lies in voice control, what with smart speakers, home appliances, and phones listening for commands to do our bidding.

But voice assistants like Amazon’s Alexa and Apple’s Siri represent their overwhelmingly white, male developers leading to gender and racial bias. For example, if you have an accent, or your native language is something other than English, chances are you won’t get what you’re asking for.

To solve this, Mozilla, a free software community, created “Common Voice” in 2017, a tool that crowdsources voices as a dataset to diversify AI and represent the global population, not just the west.

Common Voice works by releasing its growing dataset publicly so any company can use it for research and to build and train their own voice-enabled applications, ultimately working to improve vocal recognition for all regardless of language, gender, age, or accent. Currently, there’s more than 2,400 hours of voice data and 29 languages represented — including English, French, German,Traditional Mandarin Chinese, Welsh, and Kabyle.

“Existing speech recognition services are only available in languages that are financially profitable,” Kelly Davis, the Head of Machine Learning, at Mozilla told TNW. “They also tend to work better for men than women and struggle to understand people with different accents, all of which are a result of biases within the data on which they are trained.”

Speech is becoming a preferred way to interact with tech, and that’s contributed to the growth of services like Amazon Alexa and Google Assistant. These voice assistants have revolutionized the way we communicate with tech, “however, the innovative potential of this technology is widely untapped because developers, researchers, and startups around the globe working on voice-recognition technology face one problem alike: a lack of publicly available voice data in their respective language to train speech-to-text engines,” Davis explained.

Although Davis believes that representation in AI is starting to improve, albeit slowly, they’re far from where they should be. In late 2017, Amazon introduced an Indian-English accent for Alexa allowing it to pronounce Indian words and understand some Indian nuances. But the voice assistant still majorly caters to the west with six of its seven language options being European.

At the start of 2018, Google announced Hindi support for its voice assistant, but the feature was limited to just a few queries. A few months after the initial release, the tech giant updated its feature so the Google Assistant can now have a conversation in Hindi — the third most spoken language in the world.

“Largely, the efforts to address the race gap in AI have fallen on non-corporate hands,” Davis said. For example, Black In AI, a project creating ways to increase the representation of people of color in AI, was launched by ex-Googlers in 2017. However, it didn’t launch as an official extension of their company’s work, it was launched to address what they saw as a need in the community.

In April, a study by New York University’s AI Now Institute found that a lack of diverse representation at major technology companies such as Microsoft, Google, and Facebook causes AI to cater more readily to white men. The report highlighted that only 15 percent of Facebook’s AI staff are women, and the problem is even more substantial at Google where just 10 percent are female.

Today’s speech recognition technologies are largely tied up in a few products: think Amazon’s Alexa and Google’s own assistant. These major voice assistants are driven by commercial interests and only serve the majority languages, mainly English. “Most speech databases are trained with an overrepresentation of certain demographics which results in a bias towards male and white and middle class,” Davis added. “Accents and dialects that tend to be under-represented in training datasets. Many machines struggle with understanding female voices or voices of elderly people.”

Davis argues only a fraction of people benefit from vocal recognition technology. “Think about how speech recognition could be used by minority language speakers to enable more people to have access to technology and the services the internet can provide, even if they never learned how to read?” Davis said. “The same is true for visually impaired or physically handicapped people, but regular market forces will not help them.”

Common Voice hopes to speed up the process of collecting data in all languages around the globe, regardless of accent, gender, or age. “By making this data available – and developing a speech recognition engine in the open, project Deep Speech – we can empower entrepreneurs and communities to address existing gaps on their own,” Davis added.

Anyone can help diversify the vocal recognition in Mozilla’s project. Just head over to Common Voice and record yourself reading out sentences, or listen to others recordings and verify if they’re accurate. It’s projects like this that’ll eventually close the racial gap in AI, and large corporations should take note.

How Mozilla is crowdsourcing speech to diversify voice recognition

comments