Aug 4, 2025


Hugo Podworski
How to Build Lightning-Fast AI Voice Agents
A deep dive on how to achieve 1200 ms response times from your AI voice agents
…
It doesn't matter how smart or accurate your AI voice agent is. If it’s not FAST, the conversation will feel clunky.
Goal
A response time of 1200ms marks the point between a natural conversation and one that feels painfully slow.
From experience, anything really beyond that 1200 millisecond point is when users start to get frustrated. They start to mentally tune out of the conversation, and as a result, the experience gets much worse.
Components That Affect Speed
Most platforms have optimised the voice AI pipeline pretty well. They've managed to reduce the latency of everything around the components. The differentiator is the 5 components that we as developers can actually control:
Speech to Text
Choose the model with the quickest and most accurate transcription speed. It must support real-time conversation with streaming.
For instance, OpenAI Whisper was one of the first SOTA transcription models but it was not designed for real-time streaming so we couldn't use it for AI voice agents.
You want to balance accuracy and speed. Again focus on models made for real-time voice AI use cases.
Provider | Model | Speed | Accuracy | Notes |
---|---|---|---|---|
Nova 3 | Very Fast | High | Excellent balance, good pricing | |
Sonic Whisper | Very Fast | High | OpenAI Whisper optimised for real-time | |
Universal Streaming | Fast | High | Built specifically for AI voice agents |
Turn Taking
There are two main approaches with different trade-offs:
Punctuation/VAD-based: Faster but less accurate. Uses punctuation as triggers (instant response for ?,!,.) and configurable wait times for no punctuation (1-2 seconds). No additional model inference required.
Trained Models: More accurate but adds latency. Uses confidence scores to determine if speaker has finished, requires additional model inference.
Approach | Speed | Accuracy | Best For |
---|---|---|---|
Punctuation/VAD-based | Fastest | Good | Maximum speed priority |
Fast | Very Good | Balanced approach | |
Medium | Very Good | Balanced approach |
Large Language Model
Time to First Token is the most important metric - not tokens per second.
Tokens per second only matters for function calls (which need to complete fully before they can trigger).
Best providers for low time to first token: Groq, Cerebras, Together AI, Cohere.
OpenAI and Anthropic have higher latency because of their traffic compared to hardware. Their main selling point is having the best models, not the fastest inference.
Open source inference providers tend to offer faster speeds. Their main goal is quick inference. On the other hand, major model providers also serve products like ChatGPT or Claude.
Provider | Model | Time to First Token | Availability | Notes |
---|---|---|---|---|
Qwen3 235B 2507 | ~250ms | Limited quota | Crazy fast tokens/second | |
Kimi K2 | ~210ms | Good limits | Excellent balance | |
Llama 3.3 70B | ~150ms | Wide availability | Reliable fallback | |
GPT-4.0/4.1 | ~400ms | Vapi only | 1000ms latency reduction |
Text to Speech
You want to choose a model with quick generation speed while maintaining quality/realism.
Must be built for real-time conversation.
Avoid models like ElevenLabs V3 (amazing quality but bad latency).
Provider | Model | Speed | Quality | Pricing |
---|---|---|---|---|
Arcana/Mist V2 | Very Fast | Excellent | Mid-range | |
Flash/Turbo | Fast | Very Good | Higher | |
Sonic 2.0 | Very Fast | Good | Mid-range | |
Speech 02 Turbo | Fast | Excellent | Mid-range | |
Native Model | Very Fast | Good | Good value |
Telephony
Adds significant latency compared to web (200-500ms+ extra).
Choose providers with lowest latency rather than general-purpose providers.
Provider | Added Latency | Quality | Notes |
---|---|---|---|
~300ms | Excellent | Taking AI voice seriously | |
~400ms | Good | Better routing than Twilio | |
~600ms | Good | Slowest due to infrastructure load |
Real-World Performance Comparison
To demonstrate the impact, here's a comparison of two configurations:
Slow Configuration:
LLM: Claude Sonnet 4 (Anthropic)
STT: Elevenlabs Scribe
TTS: Vapi's own
Endpointing: LiveKit default
Result: 3-5 second response latency
Fast Configuration:
LLM: Llama 3.3 70B (Cerebras)
STT: Deepgram Nova 3
TTS: Vapi's own
Endpointing: VAD-based
Result: Sub-1200ms response latency consistently
The difference is immediately noticeable in conversation flow and user engagement.
…
Try it out here for yourself
…
Key Takeaways
Most platforms have optimised the voice AI pipeline well - the differentiator is component choice
Open source models with specialised inference providers often beat closed source for speed
We've seen major progress in open source non-reasoning models lately, especially from Chinese AI companies.
Sub-1200ms latency is achievable with proper component selection
Speed is as important as intelligence - a 2-3 second delay makes even the smartest agent unusable