In other tools, key steps can run sequentially. But to achieve naturally flowing conversations, voice agents need to feel instant to users. And this is only possible when key steps run in parallel.
This article examines why and how to build voice agents with concurrent key processes. We'll explore design pipelines that minimize perceived latency while maintaining system reliability and accuracy.
We'll also cover the orchestration of STT, LLM, and TTS components, examine common failure modes and their solutions, and provide practical guidance to build systems that can handle the demanding requirements of real-time conversational AI.
Key takeaways
- In real-time systems, audio capture, transcription, LLM processing, and TTS must run in parallel to avoid delays.
- Streaming STT with partials is essential to reduce latency and enable faster, more natural voice agent responses.
- Architectural choices like async queues, actor models, race conditions, and thread pools directly impact system responsiveness, fault tolerance, and scalability.
How real-time voice agents are built
Before getting deep into the architecture, let’s first examine the key stages in a real-time voice agent. At its core, every real-time voice agent implements a five-stage pipeline, though the boundaries between stages often blur in production systems:
1. Audio capture
Raw audio streams enter the system through web browsers, telephony systems, or mobile device microphones. The capture system must deal with variable network conditions, audio quality issues, and different encoding formats while maintaining a consistent internal representation.
2. Real-time speech-to-text (STT)
Streaming STT systems produce partial transcription results as audio arrives, rather than waiting for complete utterances. Early results might be incomplete or inaccurate, but they let downstream components start processing before the user finishes speaking.
The STT stage must balance speed against accuracy, often using multiple models or confidence thresholds to determine when partial results are reliable enough for processing.
3. Natural language understanding and intent recognition
Once text is available, the system extracts meaning and determines actions. This typically involves large language models that can handle context, maintain conversation state, and identify when the user has finished expressing a complete thought.
4. Response generation
Next, the system generates an appropriate response using template-based replies, complex LLM-generated text, retrieval-augmented generation (RAG) from knowledge bases, or function calling to external APIs.
The challenge lies in producing coherent, contextually appropriate responses while minimizing the time between intent recognition and response initiation.
5. Text-to-speech synthesis and audio playback
The final stage converts generated text into natural-sounding speech and delivers it to the user. Modern TTS systems can begin synthesis as soon as the first words of a response are available, streaming audio back to the user while later portions of the response are still being generated.
The goal: 500ms or less
The target for natural-feeling conversation is a total round-trip latency of under 500 milliseconds. That’s the time from when a user stops speaking to when the agent begins responding.
In a sequential system, each stage's latency adds to the delay. And you quickly blow past this goal time.
Most engineers look for the slowest stages and work to optimize them. But even the fastest sequential flows have their limits. The far more effective solution is to get these steps running in tandem.
Concurrent pipeline architecture
The five stages need not operate sequentially. A well-designed system allows multiple stages to process different parts of the conversation simultaneously:
- STT can deliver partial transcriptions before the customer has finished speaking.
- Intent recognition can operate on partial transcriptions while STT continues refining the complete phrase.
- Retrieval-augmented generation (RAG) can begin as soon as enough context is available, even before the user finishes speaking.
- TTS can prepare the first sentence of a response while the LLM continues generating subsequent sentences.
In a properly concurrent system, stages overlap significantly. This dramatically reduces perceived latency by overlapping operations that would otherwise execute sequentially.
That’s the concept in a nutshell. Now, let’s look in detail at the STT and TTS stages, which tend to be the biggest contributors to latency in voice agent tools.
Streaming STT and partials reduce latency at source
Traditional STT approaches wait for complete audio segments before producing transcriptions. By contrast, streaming STT produces partial transcription hypotheses as audio arrives. This lets downstream components begin processing before users finish speaking.
Initial partial results arrive within 100-200ms of speech, and serve as early signals to the rest of the pipeline. They let LLMs begin understanding intent and generating responses while the user is still speaking.
This predictive processing can save 200-400ms in typical interactions.
Managing partial instability
Partials are, by definition, unstable. Thus, blindly sending partials to an LLM can waste compute power and produce erratic behavior.
Instead, developers should:
- Use confidence thresholds: Only forward partials that exceed a chosen certainty score—typically 0.7-0.8.
- Wait for stabilization heuristics: Require a certain token length (3-5 words), or wait a set period of time (200-300ms) before forwarding partials.
- Track token-level deltas: Forward only new or changed tokens for incremental processing.
These techniques help balance responsiveness with reliability. They minimize jitter while avoiding undue delay.
Balancing speed and accuracy
The fundamental tradeoff in streaming STT is between latency reduction and transcription accuracy. Aggressive partial processing can reduce response times by 30-50%, but early hypotheses may prove incorrect.
Production systems typically implement multiple strategies simultaneously:
- High-confidence partials are processed immediately for common intents (such as greetings or simple requests).
- Medium-confidence partials trigger preparatory actions (loading relevant context, beginning response templates).
- Low-confidence partials are buffered until final transcription or additional context arrives.
Tune these thresholds to your needs. Customer service bots might favor accuracy over speed, while gaming or entertainment applications might prioritize responsiveness over perfect transcription.
Pre-emptive TTS and overlapping speech generation
Just as streaming STT enables early processing of user input, intelligent systems can begin preparing responses before users finish speaking. This helps create agents that feel natural in conversation, not robotic.
Sophisticated voice agents predict likely user intents and begin generating appropriate responses. If a customer says "I need to check my account..." the system can immediately begin preparing account-related responses while continuing to listen for the complete request.
This predictive approach works best for common interaction patterns where user intent becomes clear early in their utterance.
TTS preloading that supports interruptions
Modern TTS systems synthesize audio as soon as the first words are available. This dramatically reduces delays, but requires careful interrupt handling when predictions prove incorrect.
This might involve:
- Multiple candidate responses being synthesized in parallel, with the best option selected based on evolving context.
- Modular response generation where common prefixes are pre-synthesized and combined with dynamic content.
- Fallback mechanisms that can seamlessly transition between predicted and actual responses.
Natural turn-taking
Pre-emptive systems can respond so quickly that they seem to naturally jump into conversation, responding before the previous speaker finishes. But this requires sophisticated pause detection and turn-taking logic. There’s a difference between meaningful pauses and brief hesitations.
Typical implementations use:
- Multi-threshold pause detection (100ms for hesitation, 300ms for turn-taking, 800ms for definitive completion).
- Acoustic analysis to distinguish natural speech rhythm from actual conversation turns.
- Context-aware timing that adjusts pause sensitivity based on conversation state.
Essential guardrails
Pre-emptive response generation comes with the risk of speaking over users or responding to incorrect predictions. Production systems require multiple layers of protection:
- Interrupt detection: Monitor the audio input stream to detect when users resume speaking and immediately halt TTS output, with a graceful transition back to listening mode.
- Confidence gating: Generate responses only when confidence exceeds established thresholds, preventing responses to uncertain or incomplete user input.
- Latency buffers: Small delays (50-100ms) between detecting user pauses and beginning a response.
When implemented correctly, pre-emptive TTS creates the illusion of instantaneous understanding while maintaining robust error handling for edge cases.
Concurrency design patterns for real-time systems
Unlike typical web applications where it’s easy to isolate requests, voice agents must coordinate multiple long-running processes. Capturing audio, transcribing speech, processing context with an LLM, and synthesizing a response all need to happen concurrently, often with overlapping timelines.
Without careful management, you risk blocking calls, race conditions, or resource contention, which degrade latency and reliability.
Below are some of the most effective design models used in modern voice agent systems.
Async task queues
This is the most common concurrency pattern in distributed voice AI systems. Each major process—STT, NLU, RAG, TTS—is an independent task, handled by workers that scale horizontally.
Benefits:
- Natural decoupling of pipeline stages
- Built-in retry, monitoring, and rate limiting (via tools like Celery, RabbitMQ, or Kafka)
- Easier to scale stages independently
Limitations:
Queues introduce latency if not tuned carefully. To mitigate this, use bounded buffers and prioritize non-blocking operations within tasks. These drop the oldest items, reject new inputs, or trigger circuit breakers to prevent cascading failures.
Thread pools and green threads
Thread pools or green threading models are common where microsecond-level control is required. These prevent any single component from blocking the entire pipeline.
While most voice AI components are I/O bound, certain operations like audio preprocessing, confidence calculation, or response filtering can consume significant CPU resources. Dedicated computational thread pools prevent these tasks from interfering with latency-sensitive operations like STT streaming or TTS synthesis.
Benefits:
- Low overhead concurrency
- Fine-grained control over task prioritization
- Useful when all components reside in a single process or microservice
Limitations:
Race conditions can creep in. Protect shared memory with locks or actor-based isolation.
Actor models
The actor model treats each processing unit as an independent actor with its own state and message queue. A voice agent might use one actor for the STT stream, one for LLM orchestration, and one for TTS.
This isolation prevents cross-talk between concurrent conversations and simplifies error handling—if one session fails, others continue unaffected.
Benefits:
- Isolation of state per actor avoids locking issues
- Scalable across threads or distributed nodes
- Good for persistent or stateful agents (session memory, multi-turn reasoning)
- Makes systems easier to test and debug independently based on actual usage patterns
Key architectural principles
Regardless of the concurrency model you use, successful real-time systems follow shared principles:
- Non-blocking I/O: Avoid synchronous HTTP calls in the critical path
- Timeouts and fallbacks: Design for failure by bounding execution time
- Shared state management: Use immutable messages or isolated actors to prevent cross-talk
- Observability: Instrument pipelines with fine-grained metrics (latency per stage, queue lengths, dropped messages)
Common concurrency pitfalls
Despite the virtues of concurrent pipelines, real-time voice AI systems are susceptible to issues that can destroy the user experience:
- Cascading latency from blocking calls: A single blocking operation can ripple through the entire pipeline, causing noticeable conversation delays. Preventive strategies include aggressive timeouts, circuit breakers, and fallback mechanisms.
- Race conditions in conversation state: Multiple pipeline components often need to update shared conversation context simultaneously. Uncoordinated updates can lead to inconsistent state and incorrect responses. Solutions include atomic state updates, optimistic locking, or event-sourced architectures that treat state changes as immutable events.
- Timeout and retry feedback loops: Overly aggressive retry logic can amplify system load during high-traffic periods. Production systems typically implement exponential backoff with jitter, circuit breakers that prevent retries to failing services, and load shedding that gracefully degrades functionality rather than failing completely.
Lessons from real-world voice agent deployment
Even well-architected systems behave unpredictably once they hit production. From observing live deployments of voice AI products, here are some of the most common concurrency-related issues teams encounter:
Audio race conditions
Modern voice AI systems require several hundred milliseconds to fully initialize. But users often begin speaking immediately upon connection. This leads to dropped inputs, incomplete transcripts, or system stalls. This hurts the user experience, requiring customers to repeat themselves or wait for unclear signals.
How to fix it: Implement a handshake mechanism between the client and server to confirm readiness before sending audio. The trick is to ensure buffer sizes accommodate realistic initialization times (typically 200-500ms), while preventing memory exhaustion if initialization fails completely.
Other responses to race conditions include:
- A finite state machine (FSM): Track where the agent is in its lifecycle (listening → processing → speaking → ready). This helps prevent overlapping actions and block events from occurring out of order.
- Snapshot inputs: For example, when the agent is triggered to generate a response, capture the current transcript buffer as a frozen string, and use that as the prompt input. Later updates to the transcript won’t affect that LLM call.
- Serialize queues per session: Even if upstream events arrive concurrently, they’ll be handled in a controlled order. This prevents TTS generation while a new STT result is still arriving, or overlapping LLM calls for the same prompt.
- Guardrails and debouncing: Some operations should only trigger once per conversation turn. To enforce this, use guard conditions or debouncing logic to suppress rapid-fire events.
STT flooding downstream models with partials
Streaming STT systems can produce partial updates at extremely high rates—sometimes 20-50 updates per second during active speech. This makes agents highly responsive but can overwhelm downstream LLM systems.
And most STT partials are ultimately discarded. So systems that process every partial waste resources while falling further behind on meaningful processing.
How to fix it: Add debounce thresholds (e.g., only forward partials that haven’t changed for 100–200ms) or use stability scoring to decide when to trigger downstream processing.
Common patterns include:
- Confidence-based filtering: Only process partials above stability thresholds (typically 0.8+)
- Temporal debouncing: Limit processing to one partial per 200-300ms window
- Semantic change detection: Process partials only when content changes meaningfully, not just confidence scores
- Queue depth monitoring: Skip partial processing when downstream queues exceed healthy depths
Backpressure during API spikes
When many users interact simultaneously, queues can back up. This causes delayed responses or dropped outputs, especially if your infrastructure lacks flow control.
LLMs are often the root of the issue here. While STT and TTS scale horizontally, LLM inference requires expensive GPU resources. When inference queues grow beyond healthy limits, response times increase exponentially, causing upstream timeouts and visible delays.
How to fix it: Implement multiple layers of protection against traffic spikes. These include:
- Circuit breakers that fail fast when upstream services are overloaded
- Response complexity limiting that caps LLM output length during high-load periods
- Priority queues that ensure critical user interactions (existing conversations) take precedence over new sessions
- Graceful degradation that switches to faster, lower-quality models when primary systems are overloaded.
Concurrency is the new standard for voice agents
In modern voice AI systems, performance bottlenecks are shifting. If your pipeline doesn’t manage concurrency well, even the most advanced models will feel sluggish, erratic, or robotic.
Teams that treat concurrency as a first-class design challenge—and not an afterthought—are building faster, smarter, more responsive voice agents.
Here are a few final suggestions:
- Start small: Test with simple pipelines before layering complexity.
- Observe everything: Instrument every step in the real-time loop.
- Test like it’s real: Simulate real audio, real network conditions, and real user behavior under load.
Low latency is at the core of believable, natural voice AI. And with well-orchestrated processes running in sync, true conversational voice agents are here to stay.
Finally, all of this depends on great TTS tools. To learn how Gladia helps you balance accurate transcriptions with ultra-fast delivery, talk to us today. We help thousands of companies build better voice agents and call center tools. Get started for free or book a demo to see it in action.