We’re happy to announce the general availability of Gladia’s new real-time audio transcription and insights engine. An easily integrated and multilingual voice API combining speech recognition and generative AI to provide transcription, insights, and assistance for contact centers, virtual meetings, and editing platforms in real time.
Gladia Real-Time transcribes audio at latency as low as 300 milliseconds, supports 100+ languages interchangeably and includes embedded custom vocabulary, named entity recognition and sentiment analysis.
Highly versatile in its applications, real-time transcription is especially valuable for contact center solutions, software providers, voice AI companies and virtual meeting recorders. We’re thrilled to deliver this upgraded and improved product to customers worldwide.
In this blog, we’ll dive into the hidden mechanisms behind real-time transcription, explore its key challenges and use cases, and explain how to get started with real-time transcription using Gladia’s API.
Understanding live transcription
In a nutshell, live transcription operates by capturing audio input from sources like microphones or streaming services, processing the audio using Automatic Speech Recognition (ASR) and Natural Language Processing (NLP) technology, and providing a near-instant, continuous stream of transcribed text as the speaker talk.
Transcribing speech in real time is ripe with technical challenges and requires a hybrid ASR / NLP model architecture to yield accurate results consistently.
At its core, Gladia’s API is based on OpenAI’s Whisper ASR. Because the original version of the model doesn’t support real-time transcription and WebSockets, our approach consists of reengineering Whisper to add top-tier transcription in real time while keeping its core functionality and quality intact. Today, the quality of Gladia’s proprietary transcription engine is attributed to a hybrid architecture, where optimization occurs at all key stages of its end-to-end transcription process.
Reducing the latency – the time it takes for the audio input to become text output – of transcription to levels as low as 300 milliseconds requires the perfect storm. First off, you need to get to grips with your hardware and ensure that your GPU architecture matches your needs. Everything else is superfluous and will add latency. You won't get a huge drop from this alone, but a balanced GPU choice that is right for your model is a great start.
Secondly, and perhaps more importantly, you need to optimize the algorithm. Most open source models out there are not production-ready for any service. Despite having been built by some of the sharpest PhD-brains out there, they're not plug-and-play. You need to dive deep into the code and strip it of everything you don't need for your use case. Any extra weight will most likely add to your latency. Let's break this down further into specific sub-stages.
Speech recognition & NLP
First, we implement filtering or other pre-processing techniques to optimize the input audio for real-time processing.
Then, we need the system to accurately transcribe and understand speech. Our enhanced language detection, supporting 100+ languages, comes in handy here by allowing us to automatically determine the language or dialect relevant to your application. We use various NLP techniques to enhance the accuracy of transcription by considering context, grammar, and semantics, as well as adding word-level timestamp metadata if needed.
Our API also includes embedded custom vocabulary, letting you add entries to enhance the precision of transcription, especially for words or phrases that recur often in your audio file. All without compromising on latency. Further, Named Entity Recognition (NER) helps with identifying and extracting keywords and named entities such as organizations, names, locations, events, dates, and many more elements from audio files.
Real-time processing
In a live transcription scenario, audio data is continuously generated as a user speaks. The ability to display the transcript as it’s being said with minimal perceptible delay is a key technical requirement for a satisfying end-user experience.
In ASR, the delay between the time a speaker utters a word or phrase and the time the ASR system produces the corresponding transcription result is known as latency.
The acceptable range for low latency is highly dependent on the specific needs of each application and end-user expectations. Our real-time latency is around 300 milliseconds, making it optimal for most contact center solutions, software providers and AI voice assistants that require real-time control and response.
To ensure a consistent, real-time flow of information, we rely on advanced streaming capabilities and use a combination of WebSocket and VAD technologies.
WebSocket is a protocol that facilitates bidirectional, real-time communication between a client (e.g. a web browser or application) and a server (where our API is hosted), ensuring consistent low-latency audio transmission and updates. Result: immediate access to live transcriptions for end users, with reduced network overhead and resource utilization on both the client and server sides. To learn more about setting up a WebSocket and using it with Gladia, check this Golang tutorial on the topic. Other programming languages are available in the Gladia repository on GitHub, here.
Voice Activity Detection (VAD) is a technology used to determine whether there is significant audio activity (speech) in an audio signal. It analyzes incoming audio data and identifies periods of speech and silence. End-pointing is an especially critical step in VAD, where the system identifies the moment when speech ends or transitions into silence or non-speech sounds to produce more accurate end results. We set a default of 300 milliseconds of “blank” in the voice that will trigger the transcription while allowing the customers to specify the duration in which the voice is being heard.
Combining WebSockets with VAD enabled us to build an efficient and responsive live transcription machine, delivering great results in real-life professional use cases in terms of both accuracy and latency.
Important to know 💡
What is the difference between partials and finals?
Partial recognition, or ‘partials’, involves transcribing portions of spoken words or phrases as they are received, even before the speaker has finished speaking the entire word or sentence. Transcribing speech “as you go” in this way makes for lower-than-average latency, at the expense of accuracy.
In contrast, final recognition, or ‘finals’, occurs when the ASR system has enough information to transcribe a complete word or phrase. It waits for a clear endpoint before providing a transcription and is powered by a bigger model that “rewrites” the script retrospectively. The delay may be slightly longer, but still provides a near-instant experience for the user.
When to use each?
Gladia API uses a hybrid approach that combines both partial and final recognition. Our system transcribes partial segments for real-time feedback and switches to final recognition when it has enough context to transcribe with high accuracy.
As a rule of thumb, we generally recommend prioritizing finals owing to greater accuracy. That said, partials can be incredibly useful for use cases where a real-time UI display is a must.
Scalability and load balancing
Owing to the fact that the bidirectional flow to the WebSocket is constant, the underlying infrastructure needs to be running 100% of the time, which makes it more expensive.
To draw an analogy, audio processed via batch, or asynchronous, transcription can be compared to a ZIP file – since it’s compressed, its storage value for an API provider is significantly lower. With this kind of file, the so-called ‘real-time factor’ of execution is very small (e.g., 1/60 factor in the case of standard hour-long audio without diarization) compared to audio sourced from live streaming scenarios (where it becomes more like 1/1).
As such, the final key challenge of providing a live transcription API consists of finding ways to ease the load on the underlying infrastructure without imposing high costs on the client. To address this, a speech-to-text provider must design an internal infrastructure capable of scaling horizontally.
At Gladia, we implement special load-balancing strategies to distribute transcription requests across multiple servers and instances to handle high volumes of audio input – without making our clients bear an unreasonable cost.
Use cases for live transcription
Complex as it may be on the technical side, live transcription is an incredibly valuable feature that helps to gain immediate access to speaker insights and enables a delightful user experience.
Real-time transcription is especially useful in scenarios where you need to react to what's being said directly, where very low latency or wait time is required. Conversational bots are another common application, as well as real-time captions for conferences in videos.
Here are some specific use cases we’ve worked with at Gladia so far:
Virtual meetings.Documenting time-sensitive meetings without having to wait for the transcript or generating real-time captions in international meetings.
Customer support. Transcribing customer inquiries and agent responses in real-time to assist customer service representatives in providing more accurate and efficient support and conducting quality assurance.
Healthcare. Transcription during both in-person and remote medical consultations, as well as for emergency call services, for more effective time-allocation of the medical personnel’s valuable time. Can be used for medical conferences, too.
Finance. Providing the stakeholders with immediate access to up-to-date financial information in an industry where speed is key.
Media. Making use of the feature during live broadcasting and events for real-time subtitling and dubbing.
Getting started with Gladia live transcription API
At Gladia, we built an optimized version of Whisper in the form of an API, adapted to real-life professional use cases and distinguished by exceptional accuracy, speed, extended multilingual capabilities and state-of-the-art features, including speaker diarization and word-level timestamps.
To get started with live transcription, you can create a free account on app.gladia.io and consult our developer documentation for more detailed guidance on its implementation.
Contact us
Your request has been registered
A problem occurred while submitting the form.
Read more
Speech-To-Text
ASR vs. LLMs – Why voice is among the biggest challenges for AI
When people talk about recent AI advancements, Large Language Models (LLMs) like ChatGPT often steal the limelight. They summarize, write, and generate text with impressive fluency, making them the poster child of generative AI.
Ultimate guide to using LLMs with speech recognition is here!
Large Language Models (LLMs) have enabled businesses to build advanced AI-driven features, but navigating the many available models and optimization techniques isn't always easy.
What startups should look for in a speech-to-text API
The revolution in both LLMs and voice technology in recent years has opened up unprecedented opportunities for startups. From virtual meeting assistants to AI voice agents, speech-to-text (STT) capabilities are becoming central to modern applications. However, choosing the right STT API provider involves navigating a complex landscape of technical specifications, features, and trade-offs that can significantly impact your product's success.