“Clow-dia,” I say once. Twice. A third time. Defeated, I say the Americanized version of my name: “Claw-dee-ah.” Finally, Siri recognizes it.
Having to adapt our way of speaking to interact with speech recognition technologies is a familiar experience for people whose first language is not English or who do not have conventionally American-sounding names. I have even stopped using Siri because of it.
Implementation of speech recognition technology in the last few decades has unveiled a very problematic issue ingrained in them: racial bias. One recent study, published in PNAS, showed that speech recognition programs are biased against Black speakers. On average, all five programs from leading technology companies like Apple and Microsoft showed significant race disparities; they were twice as likely to incorrectly transcribe audio from Black speakers as opposed to white speakers.
In normal conversations with other people, we might choose to code-switch, alternating between languages, accents or ways of speaking, depending on one’s audience. But with automated speech recognition programs, there is no code-switching—either you assimilate, or you are not understood. This effectively censors voices that are not part of the “standard” languages or accents used to create these technologies.
“I don't get to negotiate with these devices unless I adapt my language patterns,” says Halcyon Lawrence, an assistant professor of technical communication and information design at Towson University who was not part of the study. “That is problematic.” Specifically, the problem goes beyond just having to change your way of speaking: it means having to adapt your identity and assimilate.
For Lawrence, who has a Trinidad and Tobagonian accent, and others part of our identity comes from speaking a particular language, having an accent, or using a set of speech forms such as African American Vernacular English (AAVE). For me as a Puerto Rican, saying my name in Spanish, rather than trying to translate the sounds to make it understandable for North American listeners, means staying true to my roots. Having to change such an integral part of an identity to be able to be recognized is inherently cruel, Lawrence adds: “The same way one wouldn’t expect that I would take the color of my skin off.”
The inability to be understood by speech recognition programs impacts other marginalized communities. Allison Koenecke, a computational graduate student and first author of the study, points out a uniquely vulnerable community: people with disabilities who rely on voice recognition and speech-to-text tools. “This is only going to work for one subset of the population who is able to be understood by [automated speech recognition] systems,” she says. For someone who has a disability and is dependent on these technologies, being misunderstood could have serious consequences.
There are probably many culprits for these disparities, but Koenecke points to the most likely: training data. Across the board, the “standard” data used to train speech recognition technologies are predominantly white. By using narrow speech corpora both in the words that are used and how they are said, systems exclude accents and other ways of speaking that have unique linguistic features, such as AAVE. In fact, the study found that with increased use of AAVE, the likelihood of misunderstanding also increased. Specifically, the disparities found in the study were mainly due to the way words were said, since even when speakers said identical phrases, Black speakers were again twice as likely to be misunderstood compared to white speakers.
Additionally, accent and language bias lives in the humans that create these technologies. For example, research shows that the presence of an accent affects whether jurors find people guilty and whether patients find their doctors competent. Recognizing these biases would be an important way to avoid implementing them in technologies.
Safiya Noble, associate professor of information studies at the University of California, Los Angeles, admits that language is tricky to incorporate into a technology. “Language is contextual,” says Noble, who was not involved in the study. “Certain words mean certain things when certain bodies say them, and these [speech] recognition systems really don't account for a lot of that.” But that doesn’t mean that companies shouldn’t strive to decrease bias and disparities in their technologies. However, to try to do this, they need to appreciate the complexities of human language. For this reason, solutions can come not only from the field of technology but also from the fields of humanities, linguistics, and social sciences.
Lawrence argues that developers have to be aware of the implications of the technologies they create, and that people have to question what purpose and who these technologies serve. The only way to do this is to have humanists and social scientists at the table and in dialogue with technologists to ask the important questions of if these recognition technologies could be co-opted as weapons against marginalized communities, similar to certain harmful developments with facial recognition technologies.
From the tech side, feeding more diverse training data into the programs could close this gap, says Koenecke. “I think at least increasing the share of non-standard English audio samples in the training data set will take us towards closing the race gap,” she adds. They should also test their products more widely and have more diverse work forces so people from different backgrounds and perspectives can directly influence the design of speech technologies, says Noble.
But both sides agree that tech companies must be held accountable and should aim to change. Koenecke suggests that automated speech recognition companies use their study as a preliminary benchmark and continue using this to assess their systems over time.
With these strategies, tech companies and developers may be able to make speech recognition technologies more inclusive. But if they continue to be disconnected from the complexities of human language and society without recognizing their own biases, there will continue to be gaps. In the meantime, many of us will continue to struggle between identity and being understood when interacting with Alexa, Cortana or Siri. But Lawrence chooses identity every time: “I’m not switching, I'm not doing it.”
Claudia Lopez Lloreda is a freelance science writer and neuroscience graduate student at the University of Pennsylvania.