Redefining what’s possible with speech-to-text AI

Published on Jun 1, 2023

Note: This article was originally published on Medium in February 2023.

Today, we're thrilled to release the alpha version of our audio transcription API into the world! Powered by our speech-to-text AI, it is able to transcribe 1h of audio as fast as 10s, with a word error rate (WER) [1][2] as low as 1% [3] - making it one of the biggest breakthroughs in audio transcription tech to date.

We’re very excited about what this release means for audio intelligence, and more broadly for future applications of AI to all kinds of tasks, made possible thanks to plug-and-play APIs.

You can sign up to try our audio transcription alpha here. Test, play, and let us know what you think. We can’t wait to get your feedback to make a perfect product.

Now, let’s dive into our story!

The audio intelligence market is broken

Language is the primary way we interact with those around us, and the one that carries the most amount of information — the meaning of our words, the emotions in our tone of voice, the context and layers embedded in what we choose to say or not to say.

Until today, the mass adoption of voice tech has been held back by several factors. In contrast to images, which can be reduced to mathematical models, audio is an especially difficult modality for machines to process. Impacted by attributes like quality, background noises, diversity of audio sources, formats, and compression types, it’s been quite a headache for the AI industry.

Fortunately, the latest developments in natural language processing (NLP) — especially with transformers — have led to a new generation of robust neural networks capable of capturing and translating speech more accurately.

Despite this, the market for audio recognition has remained broken for a while, with traditional providers failing to strike the right balance between quality and price, which limits widespread access to the technology.

Here’s what a typical customer experience in voice tech currently looks like:

We thought there was a better way — and went on to prove it.

Audio transcription should be a commodity, not a luxury

Speaking with our users, we realized that transcription is as fundamental to audio processing as encoding/transcoding is to video processing.

In contrast to audio, the video market has reached an optimal price for its essential commodity, close to the cost of production, with balanced unit economics. Speech transcription, on the other hand, is about 6x as expensive.

In addition, we kept hearing that existing Speech-to-Text APIs were complicated to implement, and required in-depth expertise on things like audio formats, sampling rate, bit depth [4], and stereo/mono/mix channels.

So when we set out to find the most pressing use cases to address with Gladia’s knowledge infrastructure tech, audio was the obvious place to start.

We believe that democratizing access to Speech-to-Text AI is not only a matter of cost, but also of simplifying the underlying complexity of the tools.

Our goal is to make state-of-the-art audio intelligence accessible by fundamentally challenging the approach to model deployment optimization.

Because using AI should be simple. Be it for audio or other tasks, we want our users to come as they are, whatever their native tech stack or file formats.

Our driving conviction is that everything we build as a company should put the user at most 3 clicks away from enjoying cutting-edge AI.

What core technology did we use to build our Speech-to-Text AI?

Leveraging the latest NLP, ML, and deep learning research, we created a unique Speech-to-Text API powered by OpenAI’s Whisper models, including the Large-v2.

Building on a proprietary approach to neural network optimization, we improved inference speed for high-quality speech recognition by around 60x compared to major Speech-to-Text providers. We expect to further improve on those numbers soon.

Our know-how makes it possible to achieve unmatched performance for audio transcription in terms of speed, accuracy, and price.

While we’re not able to share final pricing yet, we are committed to making Gladia APIs among the most affordable on the market, while maintaining the highest quality standards.

At the moment, we’re working with over 250 models on creating a holistic audio intelligence solution, capable of performing more than 45 tasks, including translation, conversation summaries, gender detection, and sentiment analysis.

The now-live alpha API unlocks access to its core feature: transcription. We’re on track to release the rest shortly, following the feedback from our alpha users.

A word about speed, quality, and benchmarks

Word Error Rate

With the latest advances in AI Automated Speech Recognition (ASR), we have entered a new era in quality standards, requiring new datasets and benchmarks.

For instance, Whisper is a technology that generates human-readable transcriptions [5], while previous technologies were limited to machine-readable ones. Human-readable means that the Speech-to-Text AI system will output commas, periods, quotation marks, hyphens, and case-sensitive spellings, resulting in a much higher quality result.

This means that benchmark datasets need new normalized standards to establish apples-to-apples comparisons, as human-readable transcription is on its way to becoming the norm.

In our case, we use human-readable transcription as a baseline for calculating Word Error Rate (WER), and we manage to go as low as 1% on the higher-quality transcriptions.

Inference Speed

Another metric we look at to assess how well we perform compared to traditional providers is interference speed.

We established a baseline by measuring the inference speed of major STT providers as follows: we scored 1 hour of audio with both mono and stereo configuration, at a 16KHz sampling rate, and a 16bits encoding, and compared the result to how fast our models were able to deliver the same task within the same parameters.

What happens now?

For Gladia, audio is just the first step.

Our long-term ambition is to help individuals and organizations build knowledge infrastructure platforms to connect all their internal text, audio or visual data and make it discoverable in real-time through AI-powered semantic search. More on that soon.

In the meantime, we’re looking forward to your feedback so we can debug and iterate on our first product, but also so we can see what the collective creativity of users out there will come up with.

We’re looking forward to hearing from you, and to bringing better, more accessible AI to people everywhere.

Some useful additional materials:

Video demo on using the API key in just a few clicks.

—

Footnotes

[1] See OpenAI Whisper for more info: https://arxiv.org/pdf/2212.04356.pdf.

[2] Depending on language and benchmark datasets, as stated in the paper.

[3] Based on real-world use cases tested with our clients.

[4] iZOTOPE: Digital Audio Basics: Audio Sample Rate and Bit Depth. Accessible here.

[5] https://arxiv.org/pdf/2102.11114.pdf

Contact us

Your request has been registered

A problem occurred while submitting the form.

How real-time STT empowers multilingual support & unlocks international growth

Businesses expanding globally face an immediate language barrier. Customers want service in their native tongue, but most companies and call center providers don’t have enough multilingual agents to meet that demand.

Speech-To-Text

Live transcription made simple with Twilio, Python & Gladia

Live voice AI is no longer a concept of the future. From customer support to smart IVR (Interactive Voice Response) systems, speech is now transcribed in real time—often before the speaker finishes a sentence.

Product News

Getting started with Gladia: How to build with our STT API features

Whether you’re using Gladia’s speech-to-text (STT) API during a free trial or a long-term integration, you care about one thing: getting accurate, reliable transcriptions that work for your product and users.