Word Error Rate (WER) is a metric that evaluates the performance of ASR systems by analyzing the accuracy of speech-to-text results. WER metric allows developers, scientists, and researchers to assess ASR performance. A lower WER indicates better ASR performance, and vice versa. The assessment allows for optimizing the ASR technologies over time and helps to compare speech-to-text models and providers for commercial use.
However, there’s a growing concern that benchmarks fail to represent real-world use cases, resulting in disappointing ASR performance in enterprise products and apps. In a market flooded with contradictory private benchmarks, there’s also the risk of overlooking great speech recognition alternatives due to commercial bias.
This article discusses key ASR benchmarks and highlights their limitations in assessing ASR systems in real-life scenarios. We’ll also review the key ASR terms to better understand the concepts, and provide hands-on insights and tips on how to circumvent the benchmarks limitations when assessing a speech-to-text model or provider.
Understanding Word Error Rate (WER)
Automatic speech recognition (ASR) technology detects human language and converts it into text. Google Assistant and Siri are two of the most common virtual assistants that use ASR technologies to understand spoken queries.
To convert spoken language into text, ASR technologies use various technologies. Feature extraction and acoustic modeling for sound analysis, and NLP for language modeling to make sure recognized words form clear sentences. All these steps enable ASR to convert language in different pronunciations to text. If you’re curious to learn more, we wrote a deep-dive on how ASR models work.
Word Error Rate (WER) is a metric that measures the inconsistencies between the original spoken words and the ASR transcript. It does so by calculating the number of errors an ASR system makes against a reference transcript. Humans make the reference transcript, which is supposed to be completely error-free. WER evaluates ASR accuracy in the following steps:
ASR system generates a transcript based on speech-to-text conversion.
The generated transcript is compared against an error-free transcript.
The WER is obtained by dividing the total number of errors by the total number of words.
The mathematical formula for calculating a WER is:
Let's break it down in more detail:
S (Substitution error): Total number of words substituted by an ASR system. For example, ASR recognizes bad as a bat.
D (Deletion error): Total number of words an ASR system disregards.. For example, the actual phrase “The black cat is sleeping” is converted to “The cat is sleeping” by ASR.
I (Insertion error): Total number of words inserted by an ASR in the transcript that are absent in the reference transcript. For example, the actual phrase “The cat is sleeping” is recognized as “The black cat is sleeping”.
N: Total number of words.
Consider an ASR system recognizing the phrase “The girl is born and raised in Tunisia” as “The girl is born and raised in Asia”. The ASR has substituted Tunisia with Asia, so there’s one substitution error in the ASR-generated transcript. Therefore,
WER = 1 + 0 + 0 / 7
WER =0.142 × 100
WER = 14.2%
A WER of 14.2% indicates the ASR recognized 85.8% of the words accurately with one substitution error and no insertion or deletion errors.
Importance of WER as a metric for ASR performance
WER is a standard metric for evaluating the performance of ASR systems. It serves as a guide for improving ASR systems by tracking errors in a system. The two advantages that make WER an important metric are:
Tracking improvements: WER allows developers and researchers to track ASR improvements by comparing before and after performance improvements.
Better user experience: A low WER ensures that ASR is able to generate accurate text which translates to a better user experience.
Objective benchmarking: WER is a widely agreed-upon metric, making it a standardized measure of accuracy for ASR systems.
Key benchmarks for calculating WER
Factors influencing WER
Certain factors lead to substitution, insertion, and deletion errors in ASR systems, influencing their performance. These factors include:
1. Training datasets
Datasets used for the training of ASR systems significantly influence the WER as the system learns to recognize sound in training datasets. Generally, large and diverse datasets are preferred because they contain a wider range of speech patterns and noise levels. The most common datasets used in ASR systems are:
Few-shot Learning Evaluation of Universal Representations of Speech (Fleurs)
Fleurs is a multilingual speech recognition dataset developed by Google, containing approximately 12 hours of speech supervision per language across 120 languages. Fleurs is trained on expert-generated transcripts and corresponding audio recordings and is suitable for the development of multilingual ASR systems.
LibriSpeech
LibriSpeech dataset contains approximately 1000 hours of 16kHz English speech. LibriSpeech consists of public-domain book readings by volunteers. It is suitable for the development of general-purpose English ASR systems.
Common Voice
Common Voice by Mozilla contains over 9,283 hours of speech in 60 languages, and comes in several open-source versions. The content consists of recordings of phrases and sentences spoken by volunteers. Common voice is suitable for training multilingual ASR systems.
Rev16
Rev16 dataset contains podcast recordings focusing on conversational speech. This dataset is suitable for training ASR systems specifically designed for real-world conversational speech.
Meanwhile
Meanwhile dataset contains 64 segments from The Late Show with Stephen Colbert, representing conversational speech. This dataset is suitable for robust ASR systems designed to handle noise and laughter.
2. Acoustic variability
Acoustic variability is the varying sound quality. Background noise, microphone quality, speaker clarity, pitch, and voice impact WER. For example, WER might be higher in sound recorded in noisy environments like a football stadium compared to quiet environments.
3. Speaker variability
Pronunciation, dialects, voice quality, and speed impact WER in ASR. For example, WER in English ASR might be higher for a non-English-speaking old person than a young English-speaking person.
4. Language complexity
Language variations, a vast vocabulary, homophones and other factors, such as, lexical complexity might impact WER. For example, WER might be higher in tonal languages like Thai than in English, a non-tonal language.
Challenges in benchmarking ASR systems
While WER helps identify the error rate in ASR systems, it doesn’t capture all the factors influencing ASR performance. Similarly, using other benchmarking tools to evaluate ASR systems doesn’t represent the real-life performance of a system. As an enterprise customer in search for the best speech-to-text provider for your needs, it's important to be aware of the many factors affecting WER assessment to conduct the most objective evaluation of accuracy.
The key problems and challenges in benchmarking ASR systems we've encountered are:
1. Variations in ground truth
The ground truth, which represents the correct transcription of spoken utterances, is fundamental in computing WER. Variations in the ground truth can significantly impact WER results, and, as our own experience with enterprise accounts has shown, there may be mistakes in the ground truth itself, especially where private enterprise datasets are involved.
2. Limited dataset representation
The benchmark datasets used for ASR evaluation don’t capture factors like speaker variability, domain specificity, and background noise. This leads to ASR systems performing well on benchmark datasets but showcasing higher WER in real-world datasets.
3. Benchmark specificity and bias
Benchmarks are a way to measure the performance of AI systems in specific conditions. However, systems encounter varying conditions in the real world and fail to perform well due to their reliance on benchmarks. Similarly, ASR benchmarks are prone to bias if they rely on domain-specific datasets or preprocessing techniques. For example, deep learning algorithms trained on benchmark datasets result in more errors in the real-world compared to the training phase.
Moreover, the ASR system's scoring may be unfairly affected by variations in elements like name spelling, pauses, numbers being written out as digits or as words, filler words, punctuation, linguistic spelling (e.g British vs American) and more.
4. Failure to capture real-world conditions
Many datasets are generated as part of an academic or research activity. These datasets don’t represent real-world scenarios like daily activities and conversations. Thus, benchmarks might ignore the user intent and focus on technical aspects of speech like grammar and tone.
5. Lack of standardization in benchmarking
The ASR benchmarking lacks universally accepted standards. Different benchmarks use different datasets and metrics to evaluate their systems. The variation leads to inconsistency in performance evaluation and poor performance.
6. Benchmarking costs
Designing and running custom benchmarks incur significant costs due to factors such as data acquisition, computational resources, and development time. This compels users to depend on vendor benchmarks, irrespective of their limitations, leading to potentially misleading ASR outputs.
We at Gladia believe there's hardly a one-size-fits-all solution when it comes to WER, and advocate for more transparent and nuanced understanding of this metric, as applied to the most crucial elements of the transcript, with a ground truth that is clearly defined and avoids penalizing the model based on factors that are trivial for business use cases.
Case studies and examples
An ASR-benchmarks repository on GitHub compares different ASR benchmarking methods on widely used datasets. The repository compares various benchmarking elements, including dataset, model architecture, evaluation metric, and sub-datasets. However, to keep our focus on WER, we’ll have a look at three ASR benchmarking examples in the table below:
Significant differences in WER in the ASR system with the same acoustic model and data augmentation, but different datasets indicate the subjectivity of WER.
Here’s another example. The following table displays the WERs of the Whisper large-v2 model on the five common ASR datasets:
The Whisper large-v2 model has 1550 million parameters, and is considered among the most accurate in the Whisper family.However, the same model yields different WER on different datasets. This is because the model size and architecture aren’t the only factors influencing the WER. Due to speech quality and variations, datasets also play a significant role in the performance of an ASR model.
Real-world scenarios where benchmarks fail to reflect ASR performance
ASR benchmarks have proved their inability to understand language context and generalize their training. A few of the popular real-world scenarios are:
1. Whisper model hallucinates in real-world applications
The Whisper model, developed by OpenAI, is a speech-to-text model trained on large audio and text data. The model is primarily used for transcribing audio containing speech in English and other languages. However, due to large amount of noisy data in its training datasets, Whisper may generate hallucinated text. This is because the model attempts to predict next word based on language patterns, resulting in generating text that wasn’t spoken in the audio.
Furthermore, the Whisper model displays lower accuracy when transcribing languages with limited training data. The model also struggles to accurately transcribe varying accents and dialects across different races, genders, ages, and other demographic factors. Additionally, the model also tends to generate repetitive text, especially in low resource languages.
2. Microsoft compares human and machine errors in conversational speech
Researchers at Microsoft compared ASR transcripts with human transcribers to see how well ASR performs. They concluded that though ASR technology is improving at recognizing human spoken words, it struggles to understand filled pauses and backchannels like “um”, “uh”, “uh-huh”, “like” etc. This highlights the limitation of benchmark datasets that lack real-world speech variations and noise.
3. Benchmarks fail at children's speech recognition
Benchmark datasets often contain adult data, but are also used in developing children's speech recognition systems or systems that children also use. Since children use filler words, backchannels, and varying pitches and are often surrounded by noise, benchmark ASR struggles to convert their speech to text.
Actionable tips for improving ASR evaluation
ASR remains a valuable technology despite these challenges. Though the challenges and limitations of benchmarks pose a difficulty in developing robust ASR systems, a few tips for improving ASR evaluation exist. Below are some actionable tips for improving ASR evaluation:
Use relevant datasets for your application. For example, if you’re developing a school speech recognition system, a dataset of classroom settings and playgrounds would be relevant. Using datasets that reflect different speaking styles and background noises also increases the likelihood of an ASR performing equally well in real-life scenarios.
Incorporate real-world conditions in evaluation metrics such as noise, different speaking styles, and relevant datasets. This emphasizes the use of your own datasets whenever possible. Using your own datasets will reflect real-world scenarios in the ASR system and improve its performance in real use.
Final remarks
The WER is an important metric used to evaluate an ASR system’s performance by monitoring its error rate. However, relying on benchmark algorithms, metrics and datasets doesn't always help to anticipate the ASR systems's actual performance in real-world business scenarios.
Public benchmarks generally involve certain datasets that lack speaker diversity, real-world conditions, noise, and domain specificity. The ground truth used for testing is often opaque, and may include errors to begin with.
Designed as a way to assess accuracy against some 'absolute' ideal, WER is not always helpful in professional use cases, where accuracy of key elements for further use (e.g. CRM enrichment) is the single most important indicator to measure. The standardized benchmarks can hardly help with the latter, so caution must be taking when discarding models on the ground of WER alone, as they may - with some customization involved - yield excellent results in your specific use case.
Moving forward, using niche specific datasets and data augmentation techniques like adding background noise and altering speech along with benchmarks leads to better ASR systems's scoring. We recommend exploring industry best practices through online resources and collaborating to ASR projects like Kaldi and Deep Speech to deepen your understanding of speech recognitions systems. To help, we've got a dedicated piece on the best open-source speech AI models, as well as an review of the best commercial ASR engines.
About Gladia
At Gladia, we built an optimized version of Whisper in the form of an API, which takes the best of the core model, removes its shortcomings and extends its feature set at scale for enterprise use. Our latest hallucinations-free model, Whisper-Zero, is distinguished by exceptional accuracy in noisy and multilingual environments, including with diverse dialects and accents. You can try our API directly or book a call with us about WER or other.
Contact us
Your request has been registered
A problem occurred while submitting the form.
Read more
Speech-To-Text
ASR vs. LLMs – Why voice is among the biggest challenges for AI
When people talk about recent AI advancements, Large Language Models (LLMs) like ChatGPT often steal the limelight. They summarize, write, and generate text with impressive fluency, making them the poster child of generative AI.
Ultimate guide to using LLMs with speech recognition is here!
Large Language Models (LLMs) have enabled businesses to build advanced AI-driven features, but navigating the many available models and optimization techniques isn't always easy.
What startups should look for in a speech-to-text API
The revolution in both LLMs and voice technology in recent years has opened up unprecedented opportunities for startups. From virtual meeting assistants to AI voice agents, speech-to-text (STT) capabilities are becoming central to modern applications. However, choosing the right STT API provider involves navigating a complex landscape of technical specifications, features, and trade-offs that can significantly impact your product's success.