Aug 4, 2025

Hugo Podworski

How to Build Lightning-Fast AI Voice Agents

A deep dive on how to achieve 1200 ms response times from your AI voice agents

It doesn't matter how smart or accurate your AI voice agent is. If it’s not FAST, the conversation will feel clunky.

Goal

A response time of 1200ms marks the point between a natural conversation and one that feels painfully slow.

From experience, anything really beyond that 1200 millisecond point is when users start to get frustrated. They start to mentally tune out of the conversation, and as a result, the experience gets much worse.

Components That Affect Speed

Most platforms have optimised the voice AI pipeline pretty well. They've managed to reduce the latency of everything around the components. The differentiator is the 5 components that we as developers can actually control:

Speech to Text

Choose the model with the quickest and most accurate transcription speed. It must support real-time conversation with streaming.

For instance, OpenAI Whisper was one of the first SOTA transcription models but it was not designed for real-time streaming so we couldn't use it for AI voice agents.

You want to balance accuracy and speed. Again focus on models made for real-time voice AI use cases.

Provider

Model

Speed

Accuracy

Notes

Deepgram

Nova 3

Very Fast

High

Excellent balance, good pricing

Cartesia

Sonic Whisper

Very Fast

High

OpenAI Whisper optimised for real-time

Assembly AI

Universal Streaming

Fast

High

Built specifically for AI voice agents

Turn Taking

There are two main approaches with different trade-offs:

Punctuation/VAD-based: Faster but less accurate. Uses punctuation as triggers (instant response for ?,!,.) and configurable wait times for no punctuation (1-2 seconds). No additional model inference required.

Trained Models: More accurate but adds latency. Uses confidence scores to determine if speaker has finished, requires additional model inference.

Approach

Speed

Accuracy

Best For

Punctuation/VAD-based

Fastest

Good

Maximum speed priority

LiveKit (Aggressive)

Fast

Very Good

Balanced approach

Pipecat Model

Medium

Very Good

Balanced approach

Large Language Model

Time to First Token is the most important metric - not tokens per second.

Tokens per second only matters for function calls (which need to complete fully before they can trigger).

Best providers for low time to first token: Groq, Cerebras, Together AI, Cohere.

OpenAI and Anthropic have higher latency because of their traffic compared to hardware. Their main selling point is having the best models, not the fastest inference.

Open source inference providers tend to offer faster speeds. Their main goal is quick inference. On the other hand, major model providers also serve products like ChatGPT or Claude.

Provider

Model

Time to First Token

Availability

Notes

Cerebras

Qwen3 235B 2507

~250ms

Limited quota

Crazy fast tokens/second

Groq

Kimi K2

~210ms

Good limits

Excellent balance

Groq

Llama 3.3 70B

~150ms

Wide availability

Reliable fallback

Vapi/OpenAI

GPT-4.0/4.1

~400ms

Vapi only

1000ms latency reduction

Text to Speech

You want to choose a model with quick generation speed while maintaining quality/realism.

Must be built for real-time conversation.

Avoid models like ElevenLabs V3 (amazing quality but bad latency).

Provider

Model

Speed

Quality

Pricing

Rime AI

Arcana/Mist V2

Very Fast

Excellent

Mid-range

ElevenLabs

Flash/Turbo

Fast

Very Good

Higher

Cartesia

Sonic 2.0

Very Fast

Good

Mid-range

MiniMax

Speech 02 Turbo

Fast

Excellent

Mid-range

Vapi

Native Model

Very Fast

Good

Good value

Telephony

Adds significant latency compared to web (200-500ms+ extra).

Choose providers with lowest latency rather than general-purpose providers.

Provider

Added Latency

Quality

Notes

Telnyx

~300ms

Excellent

Taking AI voice seriously

Vonage

~400ms

Good

Better routing than Twilio

Twilio

~600ms

Good

Slowest due to infrastructure load

Real-World Performance Comparison

To demonstrate the impact, here's a comparison of two configurations:

Slow Configuration:

  • LLM: Claude Sonnet 4 (Anthropic)

  • STT: Elevenlabs Scribe

  • TTS: Vapi's own

  • Endpointing: LiveKit default

  • Result: 3-5 second response latency

Fast Configuration:

  • LLM: Llama 3.3 70B (Cerebras)

  • STT: Deepgram Nova 3

  • TTS: Vapi's own

  • Endpointing: VAD-based

  • Result: Sub-1200ms response latency consistently

The difference is immediately noticeable in conversation flow and user engagement.

Try it out here for yourself

SLOW

FAST

Link: HERE

Link: HERE

Key Takeaways

  • Most platforms have optimised the voice AI pipeline well - the differentiator is component choice

  • Open source models with specialised inference providers often beat closed source for speed

  • We've seen major progress in open source non-reasoning models lately, especially from Chinese AI companies.

  • Sub-1200ms latency is achievable with proper component selection

  • Speed is as important as intelligence - a 2-3 second delay makes even the smartest agent unusable