Jan 16, 2026

Introducing the Ultravox Integration for Pipecat

Mike Depinet

Ultravox Realtime is now available as a speech-to-speech service in Pipecat. Use the deployment stack you’re used to with a model that accepts no compromise.

If you've built voice agents with Pipecat previously, you've faced a fundamental trade-off.

Speech-to-speech models like GPT Realtime and Gemini Live process audio directly, preserving tone and nuance while delivering fast responses. But when your agent needs to follow complex instructions, call tools reliably, or work with a knowledge base, these models often fall short. You get speed and native audio understanding, but at the cost of reliability.

Cascaded pipelines chain together best-in-class STT, LLM, and TTS services to get the full reasoning power of models like Claude Sonnet or GPT-5. But every hop adds latency, and the transcription step loses the richness of spoken language. You get better model intelligence, but sacrifice speed and naturalness.

Ultravox changes the equation

Like other speech-to-speech models, the Ultravox model is trained to understand audio natively, meaning incoming signal doesn’t have to be transcribed to text for inference. But unlike other models, Ultravox can match or exceed the intelligence of cascaded pipelines, meaning you no longer need to choose between conversational experience and model intelligence.

You don’t need to take our word for it–in an independent benchmark built by the Pipecat team, Ultravox v0.7 outperformed every other speech-to-speech model tested:

Metric

Ultravox v0.7

GPT Realtime

Gemini Live

Overall accuracy

97.7%

86.7%

86.0%

Tool use success (out of 300)

293

271

258

Instruction following (out of 300)

294

260

261

Knowledge grounding (out of 300)

298

300

293

Turn reliability (out of 300)

300

296

278

Median response latency

0.864s

1.536s

2.624s

Max response latency

1.888s

4.672s

30s

The benchmark reflects real-world conditions and needs, evaluating model performance in multi-turn conversations and considering tool use, instruction following, and knowledge retrieval. These results placed Ultravox ahead of GPT Realtime, Gemini Live, Nova Sonic, and Grok Realtime in head-to-head comparisons using identical test scenarios. Ultravox’s accuracy is on par with traditional text-only models like GPT-5 and Claude Sonnet 4.5, despite returning audio faster than those text models can produce text responses (which, for voice-based use cases, would still require a TTS step to produce audio output).

What this means for your Pipecat application

If you're using a speech-to-speech model today, switching to Ultravox will give you significantly better accuracy on complex tasks (tool calls that actually work, instructions that stick across turns, knowledge retrieval you can rely on) without giving up the low latency and native speech understanding you need.

If you're using a cascaded pipeline, you can switch to Ultravox and unlock the benefits of direct speech processing (faster responses, no lossy transcription, preserved vocal nuance) without sacrificing intelligence.

In either case, our new integration is designed to slot into your existing Pipecat application with minimal friction. 

  • For users with existing speech-to-speech pipelines, the new Ultravox integration should work as a drop-in replacement. 

  • For applications currently built using cascaded pipelines, you’ll replace your current STT, LLM, and TTS services with a single Ultravox service that handles the complete speech-to-speech flow.

Get started today

If you're already running voice agents in production, this is the upgrade path you've been waiting for. If you're just getting started with voice AI, there's never been a better time to build.

Check out this example to see how Ultravox works in Pipecat, then visit https://app.ultravox.ai to create an account and get your API key–no credit card required.