Hugo Podworski

How to Build Lightning-Fast AI Voice Agents

A deep dive on how to achieve 1200 ms response times from your AI voice agents

…

It doesn't matter how smart or accurate your AI voice agent is. If it’s not FAST, the conversation will feel clunky.

Goal

A response time of 1200ms marks the point between a natural conversation and one that feels painfully slow.

From experience, anything really beyond that 1200 millisecond point is when users start to get frustrated. They start to mentally tune out of the conversation, and as a result, the experience gets much worse.

Components That Affect Speed

Most platforms have optimised the voice AI pipeline pretty well. They've managed to reduce the latency of everything around the components. The differentiator is the 5 components that we as developers can actually control:

Speech to Text

Choose the model with the quickest and most accurate transcription speed. It must support real-time conversation with streaming.

For instance, OpenAI Whisper was one of the first SOTA transcription models but it was not designed for real-time streaming so we couldn't use it for AI voice agents.

You want to balance accuracy and speed. Again focus on models made for real-time voice AI use cases.

Provider	Model	Speed	Accuracy	Notes
Deepgram	Nova 3	Very Fast	High	Excellent balance, good pricing
Cartesia	Sonic Whisper	Very Fast	High	OpenAI Whisper optimised for real-time
Assembly AI	Universal Streaming	Fast	High	Built specifically for AI voice agents

Turn Taking

There are two main approaches with different trade-offs:

Punctuation/VAD-based: Faster but less accurate. Uses punctuation as triggers (instant response for ?,!,.) and configurable wait times for no punctuation (1-2 seconds). No additional model inference required.

Trained Models: More accurate but adds latency. Uses confidence scores to determine if speaker has finished, requires additional model inference.

Approach	Speed	Accuracy	Best For
Punctuation/VAD-based	Fastest	Good	Maximum speed priority
LiveKit (Aggressive)	Fast	Very Good	Balanced approach
Pipecat Model	Medium	Very Good	Balanced approach

Large Language Model

Time to First Token is the most important metric - not tokens per second.

Tokens per second only matters for function calls (which need to complete fully before they can trigger).

Best providers for low time to first token: Groq, Cerebras, Together AI, Cohere.

OpenAI and Anthropic have higher latency because of their traffic compared to hardware. Their main selling point is having the best models, not the fastest inference.

Open source inference providers tend to offer faster speeds. Their main goal is quick inference. On the other hand, major model providers also serve products like ChatGPT or Claude.

Provider	Model	Time to First Token	Availability	Notes
Cerebras	Qwen3 235B 2507	~250ms	Limited quota	Crazy fast tokens/second
Groq	Kimi K2	~210ms	Good limits	Excellent balance
Groq	Llama 3.3 70B	~150ms	Wide availability	Reliable fallback
Vapi/OpenAI	GPT-4.0/4.1	~400ms	Vapi only	1000ms latency reduction

Text to Speech

You want to choose a model with quick generation speed while maintaining quality/realism.

Must be built for real-time conversation.

Avoid models like ElevenLabs V3 (amazing quality but bad latency).

Provider	Model	Speed	Quality	Pricing
Rime AI	Arcana/Mist V2	Very Fast	Excellent	Mid-range
ElevenLabs	Flash/Turbo	Fast	Very Good	Higher
Cartesia	Sonic 2.0	Very Fast	Good	Mid-range
MiniMax	Speech 02 Turbo	Fast	Excellent	Mid-range
Vapi	Native Model	Very Fast	Good	Good value

Telephony

Adds significant latency compared to web (200-500ms+ extra).

Choose providers with lowest latency rather than general-purpose providers.

Provider	Added Latency	Quality	Notes
Telnyx	~300ms	Excellent	Taking AI voice seriously
Vonage	~400ms	Good	Better routing than Twilio
Twilio	~600ms	Good	Slowest due to infrastructure load

Real-World Performance Comparison

To demonstrate the impact, here's a comparison of two configurations:

Slow Configuration:

LLM: Claude Sonnet 4 (Anthropic)
STT: Elevenlabs Scribe
TTS: Vapi's own
Endpointing: LiveKit default
Result: 3-5 second response latency

Fast Configuration:

LLM: Llama 3.3 70B (Cerebras)
STT: Deepgram Nova 3
TTS: Vapi's own
Endpointing: VAD-based
Result: Sub-1200ms response latency consistently

The difference is immediately noticeable in conversation flow and user engagement.

…

Try it out here for yourself

…

SLOW	FAST
Link: HERE	Link: HERE

Key Takeaways

Most platforms have optimised the voice AI pipeline well - the differentiator is component choice
Open source models with specialised inference providers often beat closed source for speed
We've seen major progress in open source non-reasoning models lately, especially from Chinese AI companies.
Sub-1200ms latency is achievable with proper component selection
Speed is as important as intelligence - a 2-3 second delay makes even the smartest agent unusable