AI Model Biases: What went wrong with Whisper by OpenAI?

Published on Sep 1, 2024
AI Model Biases: What went wrong with Whisper by OpenAI?

When you start working with an AI model, however powerful, you can never be 100% sure of what will happen with it in practice. We've worked with Whisper ASR by OpenAI since its release in 2022 – and what we discovered is nothing short of surprising.

As the title suggest, this post is about (hidden) biases in AI – some of which may surprise you, too. We've given the word to our Founder and CEO, Jean-Louis Quéguiner to break it down below.

Evaluating Whisper Large-v3’s speech recognition performance

Let's start from the beginning. In November 2023, OpenAI released Whisper Large-v3, an ambitious speech-to-text model to add to the Whisper family. Billed as a solution to the problem of “low-resource languages,” it promised unparalleled multilingual support. But did it deliver?

In speech recognition today, many languages suffer from a lack of annotated training data, especially online. Before Whisper Large v3, notable efforts like Mozilla's Common Voice and Meta's "No Language Left Behind" project, spearheaded by Yann LeCun, made strides in addressing this gap.

Large v2, considered the most accurate of the Whisper models before v3, already supported 99 languages with varying levels of word error rate (WER) – a metric used to assess how accurate the model is in capturing speech. The team at OpenAI aimed to push this further with v3 – especially with non-English data, which represented only ⅓ of the original training set.

Despite the original excitement around the release, the new version introduced and/or enhanced the widely reported issues like:

  • Broken punctuation
  • Increased hallucinations
  • (Still) unreliable accuracy in under-represented languages

Having optimized the model at Gladia, I can testify that these issues are very real and affect the model’s performance in real-life use cases.

All of them have to do with the way: a) the original model was trained; b) the latest model was fine-tuned.

Not many people know this, but the fine-tuning happened in a very particular way. It took me almost a year to figure out how. Let's break it down together.

Hallucinations = training bias

Whisper is loved for many things – but it has a bad reputation for hallucinations, which introduce random words and repetitions and can ruin the transcript. Just recently, it made headlines again for hallucinating in violent language.

The reason why it happens stems from its training data: Whisper was essentially trained on YouTube and movies available on the internet. Why? These are large volumes of manually annotated audio with timestamps — which is perfect material for training an ASR model.

As a result, in response to silence in an audio recording, Whisper is likely to hallucinate with endings like classic YouTube endings like "Thank you for watching [till the end]" or "Subscribe to my channel!”

YouTube-inspired hallucinations from GitHub

Degraded punctuation further exacerbates this, as Whisper processes audio in 30-second chunks – meaning it can easily ‘miss’ punctuation in between the chunks. These flaws have been with Whisper since the start.

The fine-tuning controversy of low-resource languages

Now, back to the fine-tuned Whisper V3. Let's say you scrapped all YouTube and potentially movies out there, but you don't have any more human-annotated data (with high-quality ground truth) to train on—especially for low-resource languages, as ⅔ of all the data you have is in English.

The cheapest way to improve despite this limitation is to use your current AI to automatically annotate unannotated data and feed it into the algorithm to increase the weights and representations in your model for these languages. This way, 5-6x times more data was added.

So, this is how Whisper v3 was fine-tuned: by adding this new training data to the original dataset of low-resource languages.

The only problem is that the biases introduced in your original models — with hallucinations and slightly degraded punctuation — will now be replicated in your new "AI auto-labeled” unsupervised dataset. So you end up multiplying the bias 5-6x times for non-English languages!

And this didn't go unnoticed by the users.

Reactions to v3 on GitHub

The misleading WER (and more hidden biases)

So, we end up with a model that performs exceptionally well on paper despite having several hidden biases. How does that happen?

Among the most widely used ways to assess WER today are benchmarks involving datasets like FLEURS. These benchmarks are mostly one-sentence-based, with text being read by the speaker into the microphone in noise-less environments. Performing well against these benchmarks is much simpler than dealing with messy real-life audio.

Having worked with many benchmarks myself, I can say for a fact that WER is misleading, and fails to capture the real-life limitations and reveal biases—not even the punctuation one—because WER is based on normalized ground truth, which doesn't take into account readability or punctuation/casing. 

Perfect WER, dirty readability.

Official WER of Whisper v3 per language
Official WER of v3 per language

Among the most widely used ways to assess WER today are benchmarks involving datasets like FLEURS. These benchmarks are mostly one-sentence-based, with text being read by the speaker into the microphone in noise-less environments.

Performing well against these benchmarks is much simpler than dealing with messy real-life audio.

And it gets worse. Based on my experience, many training datasets are gender-biased, age-biased, and lack prosodic diversity. Males typically achieve over 90% accuracy, females around 80%, and children as low as 40%.

If we go back to what I said about using internet data for training, it all starts to make sense: a typical profile of abundant internet audio sources is a male dev/content creator working in a quiet, well-insulated environment with silent air conditioning and using a high-quality $300 headset.

The point of this post is not to criticize Whisper—it's still the leading speech recognition model and a key component of our product at Gladia. I’m thrilled to have contributed to optimizing it for better performance in enterprise use cases, including to mitigate hallucinations and improve recognition of accents with Whisper-Zero.

The point is that there are some inherent limitations to the ways we can currently train models. Having these reflections and addressing these biases is crucial to building more inclusive AI systems, wherever we are in the value chain.

Learn more about Gladia

Want to learn more about Gladia’s API? You can try it for yourself for free, or book a demo to learn more. If you’re more interested in the latest news and trends, sign-up for our newsletter below.

Contact us

280
Your request has been registered
A problem occurred while submitting the form.

Read more

Product News

Ultimate guide to using LLMs with speech recognition is here!

Large Language Models (LLMs) have enabled businesses to build advanced AI-driven features, but navigating the many available models and optimization techniques isn't always easy.

Speech-To-Text

Should you host an in-house speech-to-text solution or outsource to an API provider?

Businesses across industries are adopting speech-to-text (STT) technology to unlock new use cases and meet growing customer expectations. Whether it’s powering virtual assistants, transcribing conversations, or analyzing audio data for insights, STT has become essential for delivering seamless and engaging experiences.

Speech-To-Text

Best speech-to-text APIs in 2025

It’s that time of year again when we compile the top speech-to-text APIs to keep an eye on in 2025. Whether you’re looking to add voice-based AI into your products to automate customer support, enhance note-taking, supercharge your meetings, or more, this list will help you narrow-in on the right provider for your needs.

Read more