Speech-to-text (STT), also known as Automatic Speech Recognition (ASR), is an AI technology that transcribes spoken language into written text. Previously reserved for the privileged few, STT is becoming increasingly leveraged by companies worldwide to embed new audio features in existing apps and create smart assistants for a range of use cases.
If you’re a CTO, CPO, data scientist, or developer interested in getting started with ASR for your business, you’ve come to the right place.
In this article, we’ll introduce you to the main models and types of STT, explain the basic mechanics and features involved, and give you an overview of the existing open-source and API solutions to try. With a comprehensive NLP glossary at the end!
A brief history of speech-to-text models
First, some context. Speech-to-text is part of the natural language processing (NLP) branch in AI. Its goal is to make machines able to understand and transcribe human speech into a written format.
How hard can it be to transcribe speech, you may wonder. The short answer is: very. Unlike images, which can be put into a matrix in a relatively straightforward way, audio data is influenced by background noise, audio quality, accents, and industry jargon, which makes it notoriously difficult for machines to grasp.
Researchers have been grappling with these challenges for several decades now. It all began with Weaver’s memorandum in 1949, which sparked the idea of using computers to process language. Early natural language processing (NLP) models used statistical methods like Hidden Markov Models (HMM) to transcribe speech, but they were limited in their ability to accurately recognize different accents, dialects, and speech styles.
The following decades saw many important developments — from grammar theories to symbolic NLP to statistical models — all of which paved the way for the ASR systems we know today. But the real step change in the field occurred in the 2010s with the rise of machine learning (ML) and deep learning.
Statistical models were replaced by ML algorithms, such as deep neural networks (DNN) and recurrent neural networks (RNNs) capable of capturing idiomatic expressions and other nuances that were previously difficult to detect. There was still an issue of context though: the models couldn’t infer meanings of specific words based on the overall sentence, which inevitably led to mistakes.
The biggest invention of the decade, however, was the invention of transformers in 2017. Transformers revolutionized ASR with their self-attention mechanism. Unlike all previous models, transformers succeeded at capturing long-range dependencies between different parts of speech, allowing them to take into account the broader context of each transcribed sentence.
The advent of transformer-based ASR models has reshaped the field of speech recognition. Their superior performance and efficiency have empowered various applications, from voice assistants to advanced transcription and translation services.
Many consider that it was at that point that we passed from mere ‘‘speech recognition” to a more holistic domain of “language understanding”.
We’re at the stage where Speech AI providers are relying on an increasingly more diverse and hybrid AI-based system, with each new generation of tools moving closer to mimicking the way the human brain captures, processes, and analyses, speech.
As a result of the latest breakthrough, the overall performance of ASR systems – in terms of both speed and quality – has improved significantly over the years, propelled by the availability of open-source repositories, large training datasets from the web, and more accessible GPU/CPU hardware costs.
How speech-to-text works
Today, cutting-edge ASR solutions rely on a variety of models and algorithms to produce quick and accurate results. But how exactly does AI transform speech into written form?
Transcription is a complex process that involves multiple stages and AI models working together. Here's an overview of key steps in speech-to-text:
Pre-processing. Before the input audio can be transcribed, it often undergoes some pre-processing steps. This can include noise reduction, echo cancellation, and other techniques to enhance the quality of the audio signal.
Feature extraction. The audio waveform is then converted into a more suitable representation for analysis. This usually involves extracting features from the audio signal that capture important characteristics of the sound, such as frequency, amplitude, and duration. Mel-frequency cepstral coefficients (MFCCs) are commonly used features in speech processing.
Acoustic modeling. Involves training a statistical model that maps the extracted features to phonemes, the smallest units of sound in a language.
Language modeling. Language modeling focuses on the linguistic aspect of speech. It involves creating a probabilistic model of how words and phrases are likely to appear in a particular language. This helps the system make informed decisions about which words are more likely to occur, given the previous words in the sentence.
Decoding. In the decoding phase, the system uses the acoustic and language models to transcribe the audio into a sequence of words or tokens. This process involves searching for the most likely sequence of words that correspond to the given audio features.
Post-processing. The decoded transcription may still contain errors, such as misrecognitions or homophones (words that sound the same but have different meanings). Post-processing techniques, including language constraints, grammar rules, and contextual analysis, are applied to improve the accuracy and coherence of the transcription before producing the final output.
Key types of STT models
The exact way in which transcription occurs depends on the AI models used. Generally speaking, we can distinguish between the acoustic legacy systems and those based on the end-to-end deep learning models.
Acoustic systems rely on a combination of traditional models like the Hidden Markov models (HMM) and deep neural networks (DNN) to conduct a series of sub-processes to perform the steps describe above.
Transcription process here is done via traditional acoustic-phonetic matching, i.e. the system attempts to guess the word based on the sound. Because each step is executed by a separate model, this method is prone to errors and can be rather costly and inefficient due to the need to train each model involved independently.
In contrast, end-to-end systems, powered by CNNs, RNNs, and/or transformers, operate as a single neural network, with all key steps merged into a single interconnected process. A notable example of this is Whisper ASR by OpenAI.
Designed to address the limitations of legacy systems, this approach allows for greater accuracy thanks to a more elaborate embeddings-based mechanism, enabling contextual understanding of language based on the semantic proximity of each given word.
All in all, end-to-end systems are easier to train and more flexible. They also enable more advanced functionalities, such as translation, and generative AI tasks, such as summarization and semantic search.
If you want to learn about the best ASR engines on the market and models that power them, see this dedicated blog post.
Note on fine-tuning
As accurate as last-generation transcription models are, thanks to new techniques and Large Language Models (LLMs) that power them, they still need a little help before they can be applied to specific use cases without compromising the output accuracy. More specifically, the models may need additional work before they can be used for specific transcription or audio intelligence tasks.
Fine-tuning consists of adapting a pre-trained neural network to a new application by training it on task-specific data. It is key to making high-quality STT commercially viable.
In audio, fine-tuning is used to adapt models to technical professional domains (i.e. medical vocabulary, legal jargon), accents, languages, levels of noise, specific speakers, and more. In our guide to fine-tuning ASR models, we dive into the mechanics, use cases and application of this technique in a lot more details.
Thanks to fine-tuning, a one-size-fits-all model can become tailored to a wide variety of specific and niche use cases – without the need to retain it from scratch.
Key features and parameters
All of the above-mentioned models and methodologies unlock an array of value-generating features for business. To learn more about the benefits it presents across various industries, check this article.
Beyond core transcription technology, most providers today offer a range of additional features —from speaker diarization, to summarization, to sentiment analysis – collectively referred to as “audio intelligence.”
Key components of a speech-to-text API
With APIs, the foundational transcription output is not always executed by the same model as the one(s) responsible for the “intelligence” layer. In fact, the combination of several models is usually used by commercial speech-t-text providers to create high-quality and versatile enterprise-grade STT APIs.
Transcription: key notions
There are a number of parameters that affect the transcription process and can influence one’s choice of an STT solution or provider. Here are the key ones to consider.
Input
Format: Most transcription models deliver different levels of quality depending on the audio file format (m4a, mp3, mp4, mpeg), and some of them will only accept specific formats. Formats will apply differently depending on whether the transcription is asynchronous or live.
Audio encoding: Audio encoding is the process of changing audio files from one format to another, for example, to reduce the number of bits needed to transmit the audio information.
Frequency: There are minimal frequencies under which the sound is intelligible for speech-to-text models. Most audio files being produced today are at a minimum of 40 kHz, but some types of audio – such as phone recordings from call centers – are at lower frequencies, resulting in recordings at 16 kHz or even 8kHz. Higher frequencies, such as mp3 files at 128Khz, need to be resampled.
Bit depth: Bit depth indicates how much of an audio sample’s amplitude was recorded. It is a little like image resolution but for sound. A file with a higher bit depth will represent a wider range of sound, from very soft to very loud. For example, most DVDs have audio at 24 bits, while most telephony happens at 8 bits.
Channels: Input audio can come in several channels: mono (single channel), stereo (dual-channel); multi-channel (several tracks). For optimal results, many speech-to-text providers need to know how many channels are in your recording, but some of them will automatically detect the number of channels and use that information to improve transcription quality.
Output
Any transcription output should have a few basic components and will generally come in the form of a series of transcribed text with associated IDs and timestamps.
Beyond that, it’s important to consider the format of the transcription output. Most providers will provide, at the very least, a JSON file of the transcript containing at least the data points mentioned above. Some will also provide a plain text version of the transcript, such as a .txt file, or a format that lends itself to subtitling, such as SRT or VTT.
Performance
Latency
Latency refers to the delay between the moment a model receives an input (i.e., the speech or audio signal) and when it starts producing the output (i.e., the transcribed text). In STT systems, latency is a crucial factor as it directly affects the user experience. Lower latency indicates a faster response time and a more real-time transcription experience.
Inference
In AI, inference refers to the action of ‘inferring’ outputs based on data and previous learning. In STT, during the inference stage, the model leverages its learned knowledge of speech patterns and language to produce accurate transcriptions.
The efficiency and speed of inference can impact the latency of an STT system.
Accuracy
The performance of an STT model combines many factors, such as:
Robustness in adverse environments (e.g. background noise or static).
Coverage of complex vocabulary and languages.
Model architecture, training data quantity and quality.
Word Error Rate (WER) is the industry-wide metric used to evaluate the accuracy of a speech recognition system or machine translation system. It measures the percentage of words in the system's output that differ from the words in the reference or ground truth text.
Additional metrics used to benchmark accuracy are Diarization Error Rate (DER), which assesses speaker diarization and Mean Absolute Alignment Error (MAE) for word-level timestamps.
Languages
Even state-of-the-art multilingual models like OpenAI’s Whisper skew heavily towards some languages, like English, French, and Spanish. This happens either because of the data used to train them or because of the way the model weighs different parameters in the transcription process.
Additional fine-tuning and optimization techniques are necessary to extend the scope of languages and dialects, especially where open-source models are concerned.
Audio Intelligence
For an increasing number of use cases, transcription alone is not enough. Most commercial STT providers today offer at least some additional features, also known as add-ons, aimed at making transcripts easier to digest and informative, as well as to get speaker insights. Here are some examples:
A full list of features available with our own API can be found here.
Security
When it comes to data security, hosting architecture plays a significant role. Companies that want to integrate Language AI into their existing tech stack need to decide where they want the underlying network infrastructure to be located and who they want to own it: cloud multi-tenant (SaaS), cloud single-tenant, on-premise, air-gap.
And don’t forget to inquire about data handling policies and add-ons. After all, you don’t always wish for your confidential enterprise data to be used for training models. At Gladia, we comply with the latest EU regulations to ensure the full protection of user data.
What can you build with speech-to-text
AI speech-to-text is a highly versatile technology, unlocking a range of use cases across industries. With the help of a specialized API, you can embed Language AI capabilities into existing applications and platforms, allowing your users to enjoy transcriptions, subtitling, keyword search, and analytics. You can also build entirely new voice-enabled applications, such as virtual assistants and bots.
Some more specific examples:
Transcription services: Written transcripts of interviews, lectures, meetings, etc.
Call center automation: Converting audio recordings of customer interactions into text for analysis and processing.
Voice notes and dictation: Allow users to dictate notes, messages, or emails and convert them into written text.
Real-time captioning: Provide real-time captions and dubbing for live events, conferences, webinars, or videos.
Translation: Real-time translation services for multilingual communication.
Voice and keyword search: Search for information using voice commands or semantic search.
Speech analytics: Analyze recorded audio for sentiment analysis, customer feedback, or market research.
Accessibility: Develop apps that assist people with disabilities by converting spoken language into text for easier communication and understanding.
Current market for speech-to-text software
If you want to build speech recognition software, you’re essentially confronted with two options — build it in-house on top of an open-source model, or pick a specialized speech-to-text API provider.
Here’s an overview of what we consider to be the best alternatives in both categories.
The best option ultimately depends on your needs and use case. Of all the open-source models, Whisper ASR is generally considered the most performant and versatile model of data, trained on 680,000 hours of data. It has been selected by many indie developers and companies alike as a go-to foundation for their ASR efforts.
Open source vs API
Here are some factors to consider when deploying Whisper or other open-source alternatives in-house:
Do we possess the necessary AI expertise in-house to deploy a model in-house and make the necessary improvement to adapt it at scale?
Do we need just batch transcription? Or also live one? Do we need additional features, like summarization?
Are we dealing with multilingual clients?
Is our case-specific and requires a dedicated industry-specific vocabulary?
How much time can we afford to postpone going-to-market with the in-house solution in production? Do we have the necessary hardware (CAPEX) for it, too?
Based on first-hand experience with open-source models in speech-to-text, here are some of our key conclusions on the topic.
In exchange for full control and adaptability afforded by open source, you have to assume the full burden of hosting, optimizing, and maintaining the model. In contrast, speech-to-text APIs come as pre-packaged deal with optimized models (usually hybrid architectures and specialized language models), custom options, regular maintenance updates, and client support to deal with downtime or other emergencies.
Open-source models can be rough around the edges (i.e. slow, limited in features, and prone to errors), meaning that you need to have at least some AI expertise to make them work well for you. To be fully production-ready and function reliably at scale, it would more realistically require a dedicated team to guarantee top performance.
Whenever you pick the open-source route and build from scratch, your time-to-market increases. It’s important to conduct a proper risk-benefit analysis, knowing that your competitors may pick a production-ready option in the meantime and move ahead.
Commercial STT providers
Commercial STT players in the space provide a range of advantages via plug-and-play API, such as flexible pricing formulas, extended functionalities, optimized models to accommodate niche use cases, and a dedicated support team.
Beyond that, you’ll find a lot of differences between the various providers on the market.
Ever since the market for STT opened up to the general public, solutions provided by Big Tech providers such as AWS, Google, or Microsoft as part of their wider suite of services have stayed relatively expensive and poor in overall performance compared to specialized providers.
Moreover, they tend to underperform on the five key factors used to assess the quality of ASR transcription: speed, accuracy, supported languages, and extra features. Anyone looking for a provider in the space should take careful consideration of the following:
When it comes to the speed of transcription, there is a significant discrepancy between providers, ranging from as little as 10 seconds to 30 minutes or more. The latter is usually the case for the Big Tech players listed above.
Speed and accuracy are inversely proportional in STT, with some providers striking a significantly better balance than others between the two. Whereas Big Tech providers have a WER of 10%-18%, many startups and specialized providers are within the 1-10% WER range. That means, for every 100 words of transcription with a Big Tech provider, you’ll get at least 10 erroneous words. 
Number of supported languages is another differentiator to consider. Commercial offers range from 12 to 99+ supported languages. It is important to distinguish between APIs that enable multilingual transcription and/or translation and those that extend this support to other features as well.
Availability of audio intelligence features and optimizations, like speaker diarization, smart formatting, custom vocabulary, word-level timestamps, and real-time transcription, is not to be overlooked when estimating your cost-benefit ratio. These can come as part of the core offer, as in the case of Gladia API, or be sold as a separate unit or bundle.
Finally, how does this all come together to affect the price? Once again, the market offers are as varied as you’d expect. On the high end, Big Tech providers charge up to $1.44 per hour of transcription. In contrast, some startup providers charge as little as $0.26. Some will charge per minute, while others have hourly rates or tokens, and others still only offer custom quotes.
Some additional resources to help you navigate the commercial market:
And that’s a wrap! If you enjoyed our content, feel free to subscribe to our newsletter for more actionable tips and insights on Language AI.
Ultimate Glossary of Speech-to-Text AI
Speech-to-Text - also known as automatic speech recognition (ASR), it is the technology that converts spoken language into written text.
Natural Language Processing (NLP) - a subfield of AI that focuses on the interactions between computers and human language.
Machine Learning - afield of artificial intelligence that involves developing algorithms and models that allow computers to learn and make predictions or decisions based on data, without being explicitly programmed for specific tasks.
Neural Network - a machine learning algorithm that is modelled after the structure of the human brain.
Deep Learning - a subset of machine learning that involves the use of deep neural networks.
Acoustic Model - a model used in speech recognition that maps acoustic features to phonetic units.
Language Model - a statistical model used in NLP to determine the probability of a sequence of words.
Large Language Model (LLM) - advanced AI systems like GPT-3 that are trained on massive amounts of text data to generate human-like text and perform various natural language processing tasks.
Phoneme - the smallest unit of sound in a language, which is represented by a specific symbol.
Transformers -a neural network architecture that relies on a multi-head self-attention mechanism -among other things- which allows the model to attend to different parts of the input sequence to capture its relationships and dependencies.
Encoder - in the context of neural networks, a component that transforms input data into a compressed or abstract representation, often used in tasks like feature extraction or creating embeddings.
Decoder -a neural network component that takes a compressed representation (often from an encoder) and reconstructs or generates meaningful output data, frequently used in tasks like language generation or image synthesis.
Embedding - a numerical representation of an object, such as a word or an image, in a lower-dimensional space where relationships between objects are preserved. Embeddings are commonly used to convert categorical data into a format suitable for ML algorithms and to capture semantic similarities between words.
Dependencies - a relationships between words and sentences in a given text. Can be related to grammar and syntax or can be related to the content’s meaning.
Speaker Diarization - the process of separating and identifying who is speaking in a recording or audio stream. You can learn more here.
Speaker Adaptation - the process of adjusting a speech recognition model to better recognize the voice of a specific speaker.
Language Identification - the process of automatically identifying the language being spoken in an audio recording.
Keyword Spotting - the process of detecting specific words or phrases within an audio recording.
Automatic Captioning - the process of generating captions or subtitles for a video or audio recording.
Speaker Verification - the process of verifying the identity of a speaker, often used for security or authentication purposes.
Speech Synthesis - the process of generating spoken language from written text, also known as text-to-speech (TTS) technology.
Word Error Rate (WER) - a metric used to measure the accuracy of speech recognition systems.
Recurrent Neural Network (RNN) - a type of neural network that is particularly well-suited for sequential data, such as speech.
Fine-Tuning vs. Optimization - fine-tuning involves training a pre-existing model on a specific dataset or domain to adapt it for better performance, while optimization focuses on fine-tuning the hyperparameters and training settings to maximize the model's overall effectiveness. Both processes contribute to improving the accuracy and suitability of speech-to-text models for specific applications or domains.
Model Parallelism - enables different parts of a large model to be spread across multiple GPUs, allowing the model to be trained in a distributed manner with AI chips. By dividing the model into smaller parts, each part can be trained in parallel, resulting in a faster training process compared to training the entire model on a single GPU or processor.
About Gladia
At Gladia, we built an optimized version of Whisper in the form of an API, adapted to real-life business use cases and distinguished by exceptional accuracy, speed, extended multilingual capabilities, and state-of-the-art features, including speaker diarization and word-level timestamps.
Contact us
Your request has been registered
A problem occurred while submitting the form.
Read more
Speech-To-Text
Building better voice agents: Lessons from Thoughtly × Gladia's webinar
Voice AI has evolved fast — from early experiments that barely handled a “hello” to today’s real-time conversational agents running across industries. Alex Casella (CTO at Thoughtly) sat down with Gladia’s CEO Jean-Louis Quéguiner to unpack the technical and operational realities of building production-grade voice agents.
Safety, hallucinations, and guardrails: How to build voice AI agents you can trust
As voice agents become a core part of customer and employee experience, users need to know these AI systems are accurate, safe, and acting within boundaries. That’s especially true for enterprise-grade tools, where a rogue voice agent can severely damage relationships and create major legal risks.
How Aircall cut transcription time by 95% with Gladia
The contact center is transforming. Traditionally defined by manual workflows, siloed data, and reactive customer service, today's Contact Center as a Service (CCaaS) platforms are embracing a new era—one driven by real-time AI and automation.