Belkacem Benzaim

Algeria

Belkacem Benzaim

Algeria

Belkacem Benzaim

Algeria

Home

Blog

The Voice AI Infrastructure Playbook: A Product Manager's Guide to Building Real-Time Speech Agents

Date

Jan 24, 2026

Date

Jan 24, 2026

Date

Jan 24, 2026

The Four Pillars: What You’re Actually Building

Every voice agent is a pipeline of four components held together by a fifth: the orchestration layer. Think of it like a jazz quartet. Each instrument (ASR, LLM, TTS) can be world-class, but without a conductor keeping time, you get cacophony, not music.

1. The Ears: Speech-to-Text (ASR)

What it does: Converts audio streams into text in real-time.
The PM catch: Latency vs. accuracy trade-offs. Streaming ASR (like Deepgram Nova-3 or Whisper streaming) can deliver transcripts in 150-300ms, but accuracy drops in noisy environments or with heavy accents.

Infrastructure insight: Modern platforms don’t wait for you to finish speaking. They use “endpointing” algorithms to detect natural pauses and stream partial transcripts to the LLM before you’ve finished your sentence. This parallelization is what separates robotic agents from responsive ones.

2. The Brain: LLM with Conversation Memory

What it does: Interprets intent, maintains context across turns, and generates responses.
The PM catch: Standard LLMs are stateless. Your voice agent needs short-term memory (what did we just say?) and long-term memory (this caller’s order history). Platforms typically implement this via a conversation state manager that maintains a sliding window of recent turns while querying vector databases for historical context.

Infrastructure insight: Latency here is your biggest variable. A GPT-4o call might take 700ms, while Gemini Flash hits 300ms. Production systems often implement “model routing”—using faster models for simple intents and slower, smarter models for complex reasoning.

3. The Voice: Text-to-Speech (TTS)

What it does: Converts LLM output into natural-sounding speech.
The PM catch: The “time-to-first-audio” metric is everything. Traditional TTS waits for the full text, then synthesizes. Modern streaming TTS (ElevenLabs Turbo, Cartesia) starts speaking as soon as the first sentence fragment arrives from the LLM, cutting latency by 60%.

Infrastructure insight: Voice selection impacts perceived latency. A “fast” voice with lower computational requirements can shave 100ms off your pipeline, which matters more than raw model speed when you’re racing against the 500ms “natural conversation” threshold.

4. The Conductor: Orchestration Layer

This is the invisible infrastructure that separates proof-of-concepts from production. The orchestration layer manages:

WebRTC audio streaming (sub-100ms transport)
Turn-taking logic (who speaks when)
Barge-in handling (detecting when a user interrupts)
Voice Activity Detection (VAD) (knowing when someone started/stopped talking)
State management across the distributed pipeline

Without this layer, your components are just APIs talking past each other. With it, you get an agent that can handle the messy reality of human conversation: interruptions, hesitations, crosstalk, and emotional pacing.

Architecture Patterns: Choose Your Latency Budget

As a PM, you’re not wiring these components together—you’re choosing an architecture pattern that determines your user experience and technical debt. Here are your three options:

Pattern A: The Cascading Pipeline (Legacy)

Flow: Audio → STT (complete) → LLM (complete) → TTS (complete) → Audio
Latency: 800ms–2000ms
Best for: Internal tools, async voice notes, prototypes

This is the “obvious” approach. Each step finishes completely before the next starts. It’s stable and debuggable, but users will feel the pause. You’ve experienced this with early Alexa skills—that “thinking” delay is the cascading pipeline.

Pattern B: The Streaming Architecture (Production Standard)

Flow: Audio chunks → Streaming STT → Streaming LLM → Streaming TTS → Audio
Latency: 300–600ms
Best for: Customer-facing agents, sales calls, support

This is how modern platforms like Retell AI or Vapi achieve conversational fluidity. The magic happens through interleaved processing: the TTS starts speaking before the LLM has finished generating the full response, and the LLM starts reasoning before the STT has delivered the final transcript.

The PM implication: You need to budget for “token streaming” costs. Unlike chat where you pay per completion, streaming voice agents consume tokens continuously during generation. Your COGS (Cost of Goods Sold) calculations must account for partial generations interrupted by user barging-in.

Pattern C: End-to-End Speech-to-Speech (Emerging)

Flow: Audio → Multimodal Model → Audio
Latency: 200–400ms
Best for: High-empathy scenarios, emotional intelligence requirements

Newer models (GPT-4o Realtime, Hume AI) process audio directly without text intermediaries. They capture tone, emotion, and prosody that text-based pipelines lose. The trade-off? Less control over the “thought process” (no transcript to audit) and vendor lock-in to specific model providers.

The PM’s Build Workflow: From Demo to Production

Here’s how to phase your infrastructure build, using the Retell AI approach as our architectural reference (self-hosted models, streaming orchestration, HIPAA-grade compliance) without getting locked into their specific implementation:

Phase 1: The MVP (Weeks 1–3)

Goal: Prove the conversation flow works.
Infrastructure: Use an all-in-one API (Bland, Vapi, or Retell) to avoid orchestration complexity.
Key Metrics: Task completion rate, not latency. If users can’t finish a booking or support request, sub-second response times don’t matter.
PM Focus: Prompt engineering for voice. Voice agents need shorter, punchier prompts than chatbots. Users can’t reread audio; they need immediate clarity.

Phase 2: The Latency Optimization (Weeks 4–8)

Goal: Get under 500ms end-to-end.
Infrastructure moves:

Swap polling APIs for WebRTC streaming
Implement model routing (fast model for FAQs, slow model for disputes)
Add semantic caching for common responses (“What are your hours?” shouldn’t hit the LLM every time)
Deploy edge compute close to your users (voice is unforgiving of network latency)

Key Decision: Build vs. buy the orchestration layer. If you’re a healthcare or finance company with strict data residency requirements, you’ll likely need to self-host the orchestration (similar to Retell AI’s self-hosted model approach for compliance). If you’re a startup validating product-market fit, use a managed orchestrator.

Phase 3: The Resilience Layer (Weeks 9–12)

Goal: Handle the 1% of calls that break everything.
Infrastructure additions:

Fallback cascades: If Deepgram ASR fails, failover to Whisper. If GPT-4o times out, fall back to Claude Haiku.
Human-in-the-loop (HITL) triggers: Automatic escalation when sentiment drops, silence exceeds 5 seconds, or confidence scores plummet.
Observability: You need conversation tracing (Langfuse, LangSmith) to debug why the agent interrupted a grieving widow or failed to understand a heavy accent. Text logs aren’t enough—you need audio replay synchronized with model decisions.

Critical PM Decisions That Make or Break You

1. The Interruption Taxonomy

Not all interruptions are equal. Your orchestration layer must distinguish between:

Barge-in: User wants to change direction (“Actually, check tomorrow instead”)
Affirmation: User is agreeing (“Uh-huh, yes, sure”)
Backchanneling: User is listening (“Mmhmm, okay”)

Misclassifying a backchannel as a barge-in creates a chaotic experience where the agent constantly stops talking. This is why platforms like Retell AI invest heavily in “smart turn detection” algorithms that analyze raw audio waveforms, not just transcript text.

2. The Latency Budget Allocation

With a 500ms total budget, where do you spend it?

ASR: 150–200ms (streaming)
LLM: 200–300ms (with streaming response)
TTS: 100–150ms (time-to-first-audio)
Network/Overhead: 50ms

If your ASR takes 400ms, your LLM has 100ms left. Good luck. PMs must treat latency as a zero-sum resource and prioritize based on use case. A debt collection agent needs speed more than empathy; a therapy bot needs the opposite.

3. The Modalities Question

Do you need text backup? Pure voice agents are elegant until the user is in a noisy airport or has a speech impediment. Production systems typically implement multimodal fallbacks—if voice confidence drops, silently switch to SMS or offer a chat interface without losing conversation context.

The Infrastructure Gotchas

Telephony is not an afterthought. WebRTC works for in-app experiences, but if you’re handling phone calls, you need PSTN (traditional phone network) integration via Twilio or Telnyx. That adds 100–200ms of encoding/decoding latency that web-native agents avoid.

Echo cancellation is your nemesis. If the agent hears itself through the user’s microphone, it will try to respond to itself, creating an infinite loop. Production stacks require acoustic echo cancellation (AEC) hardware or software (like Krisp) before the ASR stage.

Compliance is architectural. If you’re in healthcare, your orchestration layer must support HIPAA business associate agreements—not just the LLM provider, but the STT and TTS providers too. This often forces self-hosting or dedicated single-tenant infrastructure, significantly impacting your cost model.

The Bottom Line

Building a voice agent in 2026 is less about AI model selection and more about real-time systems engineering. The winners aren’t those with the smartest LLM prompts; they’re those with the most reliable orchestration layer managing the messy intersection of human speech and machine speed.

As you scope your roadmap, remember: users don’t compare your voice agent to other AI tools. They compare it to talking to a human. That 500ms latency budget? That’s your competition against a bored customer service rep. Make it count.

Go Back

See more posts

All articles

Belkacem Benzaim

Belkacem Benzaim

Belkacem Benzaim

The Voice AI Infrastructure Playbook: A Product Manager's Guide to Building Real-Time Speech Agents

Date

Category

Date

Category

Date

Category

The Four Pillars: What You’re Actually Building

1. The Ears: Speech-to-Text (ASR)

2. The Brain: LLM with Conversation Memory

3. The Voice: Text-to-Speech (TTS)

4. The Conductor: Orchestration Layer

Architecture Patterns: Choose Your Latency Budget

Pattern A: The Cascading Pipeline (Legacy)

Pattern B: The Streaming Architecture (Production Standard)

Pattern C: End-to-End Speech-to-Speech (Emerging)

The PM’s Build Workflow: From Demo to Production

Phase 1: The MVP (Weeks 1–3)

Phase 2: The Latency Optimization (Weeks 4–8)

Phase 3: The Resilience Layer (Weeks 9–12)

Critical PM Decisions That Make or Break You

1. The Interruption Taxonomy

2. The Latency Budget Allocation

3. The Modalities Question

The Infrastructure Gotchas

The Bottom Line

See more posts

Algeria

-

Algeria

-

Algeria

-