STT API Benchmarks: How to measure accuracy, latency, and real-world Performance

Published on June 3, 2025
STT API Benchmarks: How to measure accuracy, latency, and real-world Performance

Every product that depends on voice input lives or dies by its speech-to-text performance. Whether you're enriching CRM data from support calls, powering live captions in meetings, or triggering downstream actions via LLMs, transcription accuracy and speed aren’t just nice-to-haves. They’re essential to product functionality. If your STT engine stalls on latency or mistranscribes a customer’s request, it can break automations, derail user experiences, and create costly manual work downstream.

That’s why smart teams don’t treat STT as a one-time integration. They benchmark. Regularly. Rigorously. And with real-world audio that reflects how their users actually talk.

In this guide, we’ll walk through the core metrics to measure, common pitfalls to avoid, and best practices for stress-testing STT APIs.

Understanding core STT metrics

Before you start comparing providers, it’s important to understand the numbers you’ll be looking at, and where they can mislead you. Most vendors advertise metrics like Word Error Rate (WER) or low latency, but raw scores often fail to capture what actually matters in production.

Word Error Rate (WER) and Word Accuracy Rate (WAR)

WER is the standard industry metric. It calculates how many words in a transcript differ from a reference transcript, using three key types of errors:

  • Insertions: A word is transcribed that wasn’t actually said.
  • Deletions: A word that was spoken doesn’t appear in the transcript.
  • Substitutions: A word is replaced with an incorrect one.

Here’s the standard WER formula: 

While WER is simple and standardized, it doesn’t always reflect real-world performance. We’ll explore this in more detail shortly. 

WAR (Word Accuracy Rate) is the flip side of WER. It’s often a more intuitive metric to use when talking to stakeholders because it communicates performance in positive terms ("94% accurate" instead of "6% error rate").

Why WER isn’t always enough

WER gives you a starting point, but it leaves out a lot of context that impacts real-world usability. From critical content errors to language normalization and speaker bias, here’s why relying on WER alone is risky.

WER treats all errors equally, but not all errors are equal

In most applications, some words matter a lot more than others.

In healthcare, mistranscribing a dosage (“fifty” instead of “fifteen”) can lead to life-threatening outcomes. In finance, omitting a currency or decimal point (“two thousand” vs “twenty thousand”) can cause major compliance issues. And in customer support, missing an order number or customer name could have a direct impact on retention.

That’s why it’s important to look for an STT vendor that optimizes for high-value content — like names, numbers, and identifiers — not just overall averages.

Want to learn more about how to evaluate vendors? Check out our STT Buyer’s Guide.

Normalization issues skew results 

One of the biggest misconceptions about WER is that it’s a purely objective score. In reality, it’s only as reliable as the normalization process used to calculate it.

WER normalization means standardizing both the reference transcript (the “ground truth”) and the machine-generated transcript into a standardized format before comparing them. This includes how things like punctuation, numerals, contractions, and filler words are treated.

The problem? Normalization rules vary between vendors, datasets, and teams. That makes WER results inconsistent and hard to compare.

Some examples:

  • "twenty-five" vs "25" — numeral formatting counts as a difference.
  • "it is" vs "it’s" — contractions register as errors.
  • "uh I don’t know" vs "I don’t know" — filler words inflate WER even if they don’t change meaning.
  • "color" vs "colour" — UK vs US spelling can distort results.

Without consistent rules, two models might perform the same, but get very different WER scores. 

For example:

  • Vendor A may normalize timestamps and numerals (“12 o’clock” → “12:00”)
  • Vendor B might not
  • Result: Same spoken input, different scores… but not because one model is worse.

This is especially problematic in multilingual or accent-heavy contexts, where spelling variants and contractions are more common. In these cases, small formatting differences can add up to large accuracy penalties.

Real-time transcription has different constraints

WER doesn’t account for whether a model is working with full context (like an LLM has in batch or async processing) or only a narrow view of the conversation (like an LLM has in real-time streaming). 

That’s because real-time systems must begin generating output within a few hundred milliseconds. This limited view means the model doesn't yet have the full context of the sentence, conversation, or speaker intent. As a result, it may misinterpret ambiguous phrases, misplace punctuation, or choose incorrect grammatical structures — especially in the case of similar-sounding words (like 'their' vs 'there') or incomplete phrases that rely on context to interpret correctly.

Async systems, on the other hand, process audio after the fact, with access to longer sequences and full conversations. The result? More accurate decisions on phrasing, punctuation, and word choice.

Because real-time systems generate output with limited context, their WER scores often appear worse than async systems, even if they're performing as expected. That’s why teams should avoid comparing WER across different processing modes and instead evaluate each system within its intended use case.

Biases in training data distort performance for real users.

WER averages don’t reveal how well a model performs for different groups of people.

Many STT systems underperform for women, non-native speakers, and people of color because the training data lacks prosodic diversity. Males typically achieve over 90% accuracy, females around 80%, and children as low as 40%. Whisper in particular has known accuracy gaps tied to speaker demographics.

This matters: a model that nails the average but fails for key user groups can’t be considered truly accurate. 

Formatting and output quality matter

Accuracy isn’t just about getting the right words. It’s also about how they’re delivered.

For many products, especially in customer experience, formatting directly affects usability. A phone number transcribed as “six six five eight four nine four nine nine eight” instead of “665-849-4998” can’t be copied, clicked, or used to trigger an automation. Similarly, a missed capital letter in a name (“gladia” instead of “Gladia”) can look unprofessional in a support transcript or email follow-up.

Captions, summaries, and structured data extraction all require readable and reliable formatting to function correctly.

Latency metrics: Time to First Byte vs Final Latency

There are two latency metrics that matter, and they’re often conflated.

  • Time to First Byte (TTFB): How fast the system starts returning words after you begin speaking. For example, if you send a 3-second audio clip and the API starts responding after 250ms, that’s your TTFB. It affects how responsive a voice interface feels, but doesn’t reflect the completeness of the result.
  • Latency to Final: How long it takes to receive the final version of the transcript after the speaker finishes talking. For that same 3-second clip, you might get the full output 700ms after it ends. This is the metric that actually matters for most production workflows.

We believe latency to final is the more meaningful benchmark. TTFB can be artificially fast in controlled environments or for trivial audio snippets. But if your automation is waiting on final output, that’s what you need to measure.

To get a true sense of performance, make sure you:

  • Test on realistic audio samples (not just short clips or clean single-speaker files)
  • Simulate real network conditions and concurrent usage
  • Run tests multiple times and average the results — latency can vary widely depending on load

We followed these exact best practices when benchmarking our newest model, Solaria. The result: Solaria consistently outperforms the competition on both TTFB (~270ms) and latency to final (~698ms).

Real-world performance factors

Clean audio and perfect pronunciation are great for demos. But real-world applications are messy, and that’s where most transcription models fall apart.

Let's take a closer look at the variables that actually affect STT performance in production: background noise, accent diversity, and real-time speaker variation. These are the challenges most models struggle with, and the ones your benchmarks should reflect.

Environmental noise & device quality

A customer is calling from a busy kitchen on speakerphone. Their dog’s barking, someone else is talking in the background, and the agent’s headset isn’t great either. This is the reality of contact center audio, and it’s brutal for most transcription models. These same acoustic challenges show up elsewhere, too. In sales calls, virtual meetings, and anywhere else voice interfaces are expected to perform reliably.

The environment a model was trained on directly impacts its performance. 

Solaria wasn’t trained on cherry-picked samples. It was built on noisy, real-world, multi-speaker conversations...the kind you’d actually expect in production.

Language, accent, and code-switching

It’s easy to claim support for “100+ languages.” But does the model work equally well for an accented speaker from Bangalore or Dakar? Can it handle users switching between Spanish and English mid-sentence?

Many models degrade significantly outside of clean, US-accented English. Gladia supports 100 languages — including 42 underserved ones — and is specifically designed for accent robustness and real-time code-switching.

See the full list of Gladia languages here

Model bias and demographic fairness

As we mentioned earlier, research has repeatedly shown that many STT models underperform for women, non-native speakers, and people of color. That's because many training datasets are gender-biased, age-biased, and lack prosodic diversity. 

Gladia takes a deliberate approach to mitigating bias by using diverse, customer-approved data and real-world recordings in our training pipeline.

Best practices for benchmarking STT APIs

We’ve talked a lot about why benchmarking STT APIs is so complex. But what can teams do about it? Based on our work with hundreds of start-ups and enterprise companies, here are some tried and tested best practices.

Use diverse, realistic datasets

Start with public datasets like Mozilla Common Voice and Google FLEURS. They offer accent and dialect diversity. But don’t stop there...

Use your own audio, too.

The real value comes from testing transcripts of your customer calls, your audio environments, and your domain-specific language. Also be sure to test across multiple versions of public datasets. Some vendors overfit to a single release.

Measure what actually matters

Track not just overall WER or WAR, but accuracy on high-value content: names, numbers, locations, spelled-out emails.

Look at:

Benchmark for both latency metrics

Run streaming tests and measure for both TTFB and how long it takes to return the finalized transcript (final latency). Both matter: one affects responsiveness, the other determines when downstream processes can start.

Also test at concurrency — running 50 or more simultaneous audio streams — to ensure performance scales. For startups, this prevents early architecture choices from becoming a bottleneck. For enterprise teams, it’s how you ensure reliability under production load and avoid unexpected latency spikes at scale.

Fine-tune for your use case

Most APIs are configurable. For example, Gladia offers:

  • VAD (Voice Activity Detection) control: This allows you to set thresholds for when the system should start and stop listening based on detected speech, helping filter out silence or background noise. It's especially useful in environments with intermittent speech or variable audio quality.
  • Language presets: Pre-selecting the language a speaker is likely to use helps improve recognition accuracy and reduces the risk of misclassification, particularly in multilingual or accent-heavy environments.
  • Punctuation/casing modes: Some applications require grammatically clean output, like captions or meeting summaries, while others prefer raw output for keyword extraction or intent parsing. Tuning punctuation and casing modes lets you adapt output formatting to fit your product needs.

Take the time to experiment with these settings early. Tuning them to match your real-world inputs can dramatically improve accuracy and save you hours of post-processing down the line.

Don’t just test once! Monitor over time.

Models evolve. So do your users. A provider that performs well today may underperform next month after a model update or shift in audio mix.

Most of our customers benchmark on a quarterly basis. They track latency, WAR, and entity accuracy over time to help them catch regressions early and ensure consistent performance as their audio environments evolve.

Why Gladia stands out in real-world benchmarks

When we built Solaria, we weren’t optimizing for leaderboard scores. We were optimizing for reality. From noisy call centers to multilingual conversations with overlapping speakers, Gladia is designed to perform in the conditions that actually challenge most speech-to-text systems.

Gladia delivers:

  • 94% Word Accuracy Rate with exceptional performance for high-value terms like names, numbers, and identifiers
  • Broad language coverage with real-time code-switching across 100+ languages — including 42 underserved by providers like Deepgram and AssemblyAI
  • Ultra-low latency: 270ms TTFB / 698ms to final transcript

Unlike providers who only test on clean public datasets, we validate performance across:

  • Multiple benchmark versions (FLEURS, Common Voice, and more)
  • Anonymized, permissioned customer data
  • Noisy, real-world audio from call centers, meetings, and live applications

Our benchmarks reflect how STT performs in production — not just in ideal lab conditions. That’s how we ensure our customers get consistent, reliable results.

Want to see how Gladia performs on your own audio?

Start testing today with free credits and dedicated 1:1 support from our team. We’ll help you run benchmarks, interpret results, and fine-tune the model to your needs so you can ship faster, with confidence.

Contact us

280
Your request has been registered
A problem occurred while submitting the form.

Read more

Speech-To-Text

STT API Benchmarks: How to measure accuracy, latency, and real-world Performance

Every product that depends on voice input lives or dies by its speech-to-text performance. Whether you're enriching CRM data from support calls, powering live captions in meetings, or triggering downstream actions via LLMs, transcription accuracy and speed aren’t just nice-to-haves. They’re essential to product functionality. If your STT engine stalls on latency or mistranscribes a customer’s request, it can break automations, derail user experiences, and create costly manual work downstream.

Speech-To-Text

New: Buyer's Guide to Speech-to-Text APIs

As the landscape of speech-to-text APIs continues to evolve—with growing demands around latency, language support, and compliance—it’s more important than ever to ensure that your setup aligns with your product’s direction.

Product News

Gladia and Pipecat partner to push the boundaries of real-time voice AI

We’re thrilled to announce a strategic partnership between Gladia and Daily, the team behind Pipecat, aimed at revolutionizing real-time conversational AI. This collaboration combines our cutting-edge audio intelligence capabilities with their flexible 100% open-source framework, empowering developers to create more dynamic, multilingual, and context-aware voice AI applications.

Read more