Accueil
Blog
Redefining what’s possible with speech-to-text AI

Redefining what’s possible with speech-to-text AI

Redefining what’s possible with speech-to-text AI
Published on
Mar 2024

Today, we're thrilled to release the alpha version of our audio transcription API into the world! Powered by our speech-to-text AI, it is able to transcribe 1h of audio as fast as 10s, with a word error rate (WER) [1][2] as low as 1% [3] - making it one of the biggest breakthroughs in audio transcription tech to date.

We’re very excited about what this release means for audio intelligence, and more broadly for future applications of AI to all kinds of tasks, made possible thanks to plug-and-play APIs.

You can sign up to try our audio transcription alpha here. Test, play, and let us know what you think. We can’t wait to get your feedback to make a perfect product.

Now, let’s dive into our story!

The audio intelligence market is broken

Language is the primary way we interact with those around us, and the one that carries the most amount of information — the meaning of our words, the emotions in our tone of voice, the context and layers embedded in what we choose to say or not to say.

Until today, the mass adoption of voice tech has been held back by several factors. In contrast to images, which can be reduced to mathematical models, audio is an especially difficult modality for machines to process. Impacted by attributes like quality, background noises, diversity of audio sources, formats, and compression types, it’s been quite a headache for the AI industry.

Fortunately, the latest developments in natural language processing (NLP) — especially with transformers — have led to a new generation of robust neural networks capable of capturing and translating speech more accurately.

Despite this, the market for audio recognition has remained broken for a while, with traditional providers failing to strike the right balance between quality and price, which limits widespread access to the technology.

Here’s what a typical customer experience in voice tech currently looks like:

We thought there was a better way — and went on to prove it.

Audio transcription should be a commodity, not a luxury

Speaking with our users, we realized that transcription is as fundamental to audio processing as encoding/transcoding is to video processing.

In contrast to audio, the video market has reached an optimal price for its essential commodity, close to the cost of production, with balanced unit economics. Speech transcription, on the other hand, is about 6x as expensive.

In addition, we kept hearing that existing Speech-to-Text APIs were complicated to implement, and required in-depth expertise on things like audio formats, sampling rate, bit depth [4], and stereo/mono/mix channels.

So when we set out to find the most pressing use cases to address with Gladia’s knowledge infrastructure tech, audio was the obvious place to start.

We believe that democratizing access to Speech-to-Text AI is not only a matter of cost, but also of simplifying the underlying complexity of the tools.

Our goal is to make state-of-the-art audio intelligence accessible by fundamentally challenging the approach to model deployment optimization.

Because using AI should be simple. Be it for audio or other tasks, we want our users to come as they are, whatever their native tech stack or file formats.

Our driving conviction is that everything we build as a company should put the user at most 3 clicks away from enjoying cutting-edge AI.

What core technology did we use to build our Speech-to-Text AI?

Leveraging the latest NLP, ML, and deep learning research, we created a unique Speech-to-Text API powered by OpenAI’s Whisper models, including the Large-v2.

Building on a proprietary approach to neural network optimization, we improved inference speed for high-quality speech recognition by around 60x compared to major Speech-to-Text providers. We expect to further improve on those numbers soon.

Our know-how makes it possible to achieve unmatched performance for audio transcription in terms of speed, accuracy, and price.
Gladia's key performance indicators

While we’re not able to share final pricing yet, we are committed to making Gladia APIs among the most affordable on the market, while maintaining the highest quality standards.

At the moment, we’re working with over 250 models on creating a holistic audio intelligence solution, capable of performing more than 45 tasks, including translation, conversation summaries, gender detection, and sentiment analysis.

The now-live alpha API unlocks access to its core feature: transcription. We’re on track to release the rest shortly, following the feedback from our alpha users.

A word about speed, quality, and benchmarks

Word Error Rate

With the latest advances in AI Automated Speech Recognition (ASR), we have entered a new era in quality standards, requiring new datasets and benchmarks.

For instance, Whisper is a technology that generates human-readable transcriptions [5], while previous technologies were limited to machine-readable ones. Human-readable means that the Speech-to-Text AI system will output commas, periods, quotation marks, hyphens, and case-sensitive spellings, resulting in a much higher quality result.

This means that benchmark datasets need new normalized standards to establish apples-to-apples comparisons, as human-readable transcription is on its way to becoming the norm.

In our case, we use human-readable transcription as a baseline for calculating Word Error Rate (WER), and we manage to go as low as 1% on the higher-quality transcriptions.

Inference Speed

Another metric we look at to assess how well we perform compared to traditional providers is interference speed.

We established a baseline by measuring the inference speed of major STT providers as follows: we scored 1 hour of audio with both mono and stereo configuration, at a 16KHz sampling rate, and a 16bits encoding, and compared the result to how fast our models were able to deliver the same task within the same parameters.

Gladia's performance benchmarks

What happens now?

For Gladia, audio is just the first step.

Our long-term ambition is to help individuals and organizations build knowledge infrastructure platforms to connect all their internal text, audio or visual data and make it discoverable in real-time through AI-powered semantic search. More on that soon.

In the meantime, we’re looking forward to your feedback so we can debug and iterate on our first product, but also so we can see what the collective creativity of users out there will come up with.

We’re looking forward to hearing from you, and to bringing better, more accessible AI to people everywhere.

Some useful additional materials:

  • Video demo on using the API key in just a few clicks.

Footnotes

[1] See OpenAI Whisper for more info: https://arxiv.org/pdf/2212.04356.pdf.

[2] Depending on language and benchmark datasets, as stated in the paper.

[3] Based on real-world use cases tested with our clients.

[4] iZOTOPE: Digital Audio Basics: Audio Sample Rate and Bit Depth. Accessible here.

[5] https://arxiv.org/pdf/2102.11114.pdf

Contact us

280
Your request has been registered
A problem occurred while submitting the form.

Read more

Product News

Gladia selected to participate in the 2024 AWS Generative AI Accelerator

We’re proud to announce that Gladia has been selected for the second cohort of the AWS Generative AI Accelerator, a global program offering top early-stage startups that are using generative AI to solve complex challenges, learn go-to-market strategies, and access to mentorship and AWS credits.

Tutorials

How to implement advanced speaker diarization and emotion analysis for online meetings

In our previous article, we discussed how to unlock some of that data by building a speaker diarization system for online meetings (POC) to identify speakers in audio streams and provide organizations with detailed speaker-based insights into meetings, create meeting summaries, action items, and more.

Case Studies

How Selectra is automating quality monitoring of sales calls with speech-to-text AI

In the past few years, the democratization of speech recognition and large language models has created new opportunities for voice-first platforms to automate critical workflows. Customer support is one of the most promising and vibrant areas for these innovations.