Why the hardest part of voice AI isn't the conversation, it's the orchestration, and how modern platforms like Retell AI architect their stack for sub-second latency
You’ve shipped chatbots. You’ve optimized recommendation engines. Now your roadmap says “AI Voice Agent,” and suddenly you’re drowning in acronyms: ASR, TTS, VAD, WebRTC, barge-in handling.
The brutal truth? Building a voice agent that feels natural requires orchestrating four independent real-time systems in under 500 milliseconds. Get it wrong, and you’ve built an expensive phone tree that pauses awkwardly and interrupts users mid-sentence. Get it right, and you’ve created an interface that converts 3x better than chat.
This is your infrastructure map. We’ll use the architecture patterns from modern conversation platforms (think Retell AI, Vapi, or LiveKit) as our blueprint—not to sell you their product, but to understand how production-grade voice stacks actually work under the hood.
The Four Pillars: What You’re Actually Building
Every voice agent is a pipeline of four components held together by a fifth: the orchestration layer. Think of it like a jazz quartet. Each instrument (ASR, LLM, TTS) can be world-class, but without a conductor keeping time, you get cacophony, not music.
1. The Ears: Speech-to-Text (ASR)
What it does: Converts audio streams into text in real-time.
The PM catch: Latency vs. accuracy trade-offs. Streaming ASR (like Deepgram Nova-3 or Whisper streaming) can deliver transcripts in 150-300ms, but accuracy drops in noisy environments or with heavy accents.
Infrastructure insight: Modern platforms don’t wait for you to finish speaking. They use “endpointing” algorithms to detect natural pauses and stream partial transcripts to the LLM before you’ve finished your sentence. This parallelization is what separates robotic agents from responsive ones.
2. The Brain: LLM with Conversation Memory
What it does: Interprets intent, maintains context across turns, and generates responses.
The PM catch: Standard LLMs are stateless. Your voice agent needs short-term memory (what did we just say?) and long-term memory (this caller’s order history). Platforms typically implement this via a conversation state manager that maintains a sliding window of recent turns while querying vector databases for historical context.
Infrastructure insight: Latency here is your biggest variable. A GPT-4o call might take 700ms, while Gemini Flash hits 300ms. Production systems often implement “model routing”—using faster models for simple intents and slower, smarter models for complex reasoning.
3. The Voice: Text-to-Speech (TTS)
What it does: Converts LLM output into natural-sounding speech.
The PM catch: The “time-to-first-audio” metric is everything. Traditional TTS waits for the full text, then synthesizes. Modern streaming TTS (ElevenLabs Turbo, Cartesia) starts speaking as soon as the first sentence fragment arrives from the LLM, cutting latency by 60%.
Infrastructure insight: Voice selection impacts perceived latency. A “fast” voice with lower computational requirements can shave 100ms off your pipeline, which matters more than raw model speed when you’re racing against the 500ms “natural conversation” threshold.
4. The Conductor: Orchestration Layer
This is the invisible infrastructure that separates proof-of-concepts from production. The orchestration layer manages:
WebRTC audio streaming (sub-100ms transport)
Turn-taking logic (who speaks when)
Barge-in handling (detecting when a user interrupts)
Voice Activity Detection (VAD) (knowing when someone started/stopped talking)
State management across the distributed pipeline
Without this layer, your components are just APIs talking past each other. With it, you get an agent that can handle the messy reality of human conversation: interruptions, hesitations, crosstalk, and emotional pacing.
Architecture Patterns: Choose Your Latency Budget
As a PM, you’re not wiring these components together—you’re choosing an architecture pattern that determines your user experience and technical debt. Here are your three options:
Pattern A: The Cascading Pipeline (Legacy)
Flow: Audio → STT (complete) → LLM (complete) → TTS (complete) → Audio
Latency: 800ms–2000ms
Best for: Internal tools, async voice notes, prototypes
This is the “obvious” approach. Each step finishes completely before the next starts. It’s stable and debuggable, but users will feel the pause. You’ve experienced this with early Alexa skills—that “thinking” delay is the cascading pipeline.
Pattern B: The Streaming Architecture (Production Standard)
Flow: Audio chunks → Streaming STT → Streaming LLM → Streaming TTS → Audio
Latency: 300–600ms
Best for: Customer-facing agents, sales calls, support
This is how modern platforms like Retell AI or Vapi achieve conversational fluidity. The magic happens through interleaved processing: the TTS starts speaking before the LLM has finished generating the full response, and the LLM starts reasoning before the STT has delivered the final transcript.
The PM implication: You need to budget for “token streaming” costs. Unlike chat where you pay per completion, streaming voice agents consume tokens continuously during generation. Your COGS (Cost of Goods Sold) calculations must account for partial generations interrupted by user barging-in.
Pattern C: End-to-End Speech-to-Speech (Emerging)
Flow: Audio → Multimodal Model → Audio
Latency: 200–400ms
Best for: High-empathy scenarios, emotional intelligence requirements
Newer models (GPT-4o Realtime, Hume AI) process audio directly without text intermediaries. They capture tone, emotion, and prosody that text-based pipelines lose. The trade-off? Less control over the “thought process” (no transcript to audit) and vendor lock-in to specific model providers.
The PM’s Build Workflow: From Demo to Production
Here’s how to phase your infrastructure build, using the Retell AI approach as our architectural reference (self-hosted models, streaming orchestration, HIPAA-grade compliance) without getting locked into their specific implementation:
Phase 1: The MVP (Weeks 1–3)
Goal: Prove the conversation flow works.
Infrastructure: Use an all-in-one API (Bland, Vapi, or Retell) to avoid orchestration complexity.
Key Metrics: Task completion rate, not latency. If users can’t finish a booking or support request, sub-second response times don’t matter.
PM Focus: Prompt engineering for voice. Voice agents need shorter, punchier prompts than chatbots. Users can’t reread audio; they need immediate clarity.
Phase 2: The Latency Optimization (Weeks 4–8)
Goal: Get under 500ms end-to-end.
Infrastructure moves:
Swap polling APIs for WebRTC streaming
Implement model routing (fast model for FAQs, slow model for disputes)
Add semantic caching for common responses (“What are your hours?” shouldn’t hit the LLM every time)
Deploy edge compute close to your users (voice is unforgiving of network latency)
Key Decision: Build vs. buy the orchestration layer. If you’re a healthcare or finance company with strict data residency requirements, you’ll likely need to self-host the orchestration (similar to Retell AI’s self-hosted model approach for compliance). If you’re a startup validating product-market fit, use a managed orchestrator.
Phase 3: The Resilience Layer (Weeks 9–12)
Goal: Handle the 1% of calls that break everything.
Infrastructure additions:
Fallback cascades: If Deepgram ASR fails, failover to Whisper. If GPT-4o times out, fall back to Claude Haiku.
Human-in-the-loop (HITL) triggers: Automatic escalation when sentiment drops, silence exceeds 5 seconds, or confidence scores plummet.
Observability: You need conversation tracing (Langfuse, LangSmith) to debug why the agent interrupted a grieving widow or failed to understand a heavy accent. Text logs aren’t enough—you need audio replay synchronized with model decisions.
Critical PM Decisions That Make or Break You
1. The Interruption Taxonomy
Not all interruptions are equal. Your orchestration layer must distinguish between:
Barge-in: User wants to change direction (“Actually, check tomorrow instead”)
Affirmation: User is agreeing (“Uh-huh, yes, sure”)
Backchanneling: User is listening (“Mmhmm, okay”)
Misclassifying a backchannel as a barge-in creates a chaotic experience where the agent constantly stops talking. This is why platforms like Retell AI invest heavily in “smart turn detection” algorithms that analyze raw audio waveforms, not just transcript text.
2. The Latency Budget Allocation
With a 500ms total budget, where do you spend it?
ASR: 150–200ms (streaming)
LLM: 200–300ms (with streaming response)
TTS: 100–150ms (time-to-first-audio)
Network/Overhead: 50ms
If your ASR takes 400ms, your LLM has 100ms left. Good luck. PMs must treat latency as a zero-sum resource and prioritize based on use case. A debt collection agent needs speed more than empathy; a therapy bot needs the opposite.
3. The Modalities Question
Do you need text backup? Pure voice agents are elegant until the user is in a noisy airport or has a speech impediment. Production systems typically implement multimodal fallbacks—if voice confidence drops, silently switch to SMS or offer a chat interface without losing conversation context.
The Infrastructure Gotchas
Telephony is not an afterthought. WebRTC works for in-app experiences, but if you’re handling phone calls, you need PSTN (traditional phone network) integration via Twilio or Telnyx. That adds 100–200ms of encoding/decoding latency that web-native agents avoid.
Echo cancellation is your nemesis. If the agent hears itself through the user’s microphone, it will try to respond to itself, creating an infinite loop. Production stacks require acoustic echo cancellation (AEC) hardware or software (like Krisp) before the ASR stage.
Compliance is architectural. If you’re in healthcare, your orchestration layer must support HIPAA business associate agreements—not just the LLM provider, but the STT and TTS providers too. This often forces self-hosting or dedicated single-tenant infrastructure, significantly impacting your cost model.
The Bottom Line
Building a voice agent in 2026 is less about AI model selection and more about real-time systems engineering. The winners aren’t those with the smartest LLM prompts; they’re those with the most reliable orchestration layer managing the messy intersection of human speech and machine speed.
As you scope your roadmap, remember: users don’t compare your voice agent to other AI tools. They compare it to talking to a human. That 500ms latency budget? That’s your competition against a bored customer service rep. Make it count.
Go Back

