Jun 25, 2025

Meet Ultravox: Real-World Voice Intelligence, At Scale

First off, we’re excited to announce that Fixie.ai is now officially Ultravox AI, the voice-native AI platform built for the messy, high-stakes conversations enterprises have with real customers. We started Fixie.ai when there was no ChatGPT, agents weren’t yet a buzzword, and the average person had never heard the term “LLM”. Ultravox has been our full time focus for the past couple of years, so we’re excited to fully step into our new name.

Alongside with our new brand, we’re shipping a number of major improvements to the Ultravox platform today:

Ultravox v0.6: Our sixth-generation speech-native model, trained specifically around improving speech understanding in difficult and noisy conditions. The Ultravox model remains the leader in speech understanding.
New model offerings: In addition to Llama 3.3, we’ve trained two new models that are available today in the platform: Qwen3 (from Alibaba) and Gemma 3 (from Google). Qwen and Gemma are some of the best models available in open source, and we’re thrilled to be able to bring them to the platform. They’re easy to prompt, great at instruction following, and excel and tool usage.
Unlimited Concurrency: We’re removing hard concurrency caps on all of our paid plans. We’ve spent the last few months designing an auto-scaling infrastructure model that kills the “$10 per line” tax you see in legacy pricing tables.
A new pricing ladder that lets teams graduate from free tinkering to enterprise-grade SLAs without guessing how many “concurrent calls” they’ll need next quarter.

Voice AI’s Frankenstack Problem

Walk into any voice AI demo and you'll see the same architecture: Speech to text (ASR) → LLM → Text-to-Speech (TTS). It's the obvious approach. It's also fundamentally broken:

Latency piles up. The polite pause you hear between turns isn’t the bot “thinking.” It’s the pipeline waiting for every component to finish. Your AI forgets the emotion in someone's voice the moment it becomes text. Each component introduces failure points. One service goes down, your entire voice experience dies.
Scalability hits a wall. These various component providers weren’t designed for real-time, so as you try to scale, all of the warts and failure points emerge quickly.
Understanding is always diminished. Real conversation is not about just the words we say but how we say them. Human brains don’t first convert speech into text and then make sense of it. Getting to human-like conversation requires all of the context making its way to the model, something that is fundamentally unfixable in the component stack.

When conversations are short demos in perfect audio, you can hide these flaws. When you’re handling thousands of real customers across phone lines, Bluetooth headsets, and multilingual chaos, the Frankenstein stack breaks.

A Native-Voice Architecture & Why We Bet on It Early

Ultravox cofounder and Chief Research Officer Zhongqiang (ZQ) Huang, whose PhD work blends cognitive linguistics and speech processing, has always pointed to the same insight: Humans don’t think in text, then speak. We reason inside a rich, audio-and-context soup and reply in <500ms.

Our question was simple: what if the model treated speech as a first-class input instead of a codec to translate away?

Instead of translating between voice and text, Ultravox makes voice a native language for AI.

Ultravox v0.6: Speech-to-Speech Intelligence

Speech-Native Models: Our new Ultravox v0.6 model processes speech directly. No translation layer. No intelligence loss. It understands not just your words, but your tone, your timing, your intent—just like humans do.

Contextual Understanding: v0.6 uses previous conversation context to improve its predictions. If you're in India talking about food delivery and mention "Zomato," it doesn't think you said "tomato" with an accent. It remembers this is a conversation about delivery apps and gets it right.

Real-World Robustness at +30% accuracy: Unlike models trained in quiet labs, Ultravox is built for the chaos of actual human conversation. Tested against the noises of a cafe — think coffee grinders, multiple overlapping speakers, background sirens — it handles the mess because that's where real conversations happen.

Sub-500ms Response Time: Because we're not translating between modalities, we can respond in under 500 milliseconds. That's fast enough to feel natural, even when someone interrupts or changes topics mid-sentence.

We still give developers the textual transcript when you need it—think analytics, redaction, or CRM storage—but the core reasoning loop never loses fidelity.

Preview Variants: Gemma 3 & Qwen 3

Component stacks chase the next proprietary model upgrade. We built a framework to absorb new foundation models quickly and endow them with speech skills. Today we’re opening developer-preview endpoints for Google’s Gemma 3 and Alibaba’s Qwen 3.

What This Architecture Enables: Unlimited Concurrency

Here's where it gets interesting. Because we own the entire stack—from the speech models to the GPU orchestration—we can do things that component-based platforms simply can't. We can treat capacity as an elastic pool, not a line item.

Put another way: Ultravox is designed from the ground-up for scale. Whether it's one call per day or 500,000, we're ready to scale with you.

Why This Matters Now

Customers are tired of IVR trees. They expect an expert who answers like a person and never makes them repeat a 16-digit order number. Enterprises are tired of POCs. They need bots that survive weekend spikes, regional accents, and compliance audits.

By making voice native and scalability intrinsic, Ultravox turns a “cool demo” tech into a production system that:

Keeps full intelligence even when audio is messy.
Responds at human cadence (< 500 ms) so people don’t hang up.
Scales on demand without CFO-scary line charges.

That’s why we’re comfortable planting our flag today:

Ultravox is the first voice-AI stack built for real-world, revenue-bearing conversations — no Frankensteck required.

What's Next

Over the next few weeks we’ll be releasing a variety of new features from state-of-the-art noise cancellation to new tools to help you write (and evaluate!) high-quality prompts that are designed to scale