What we need to make voice AI fully agentic

Zach Koch

There’s been an explosion of Voice AI “agents” over the past couple of years, but the truth is that there is very little agentic about them. Most Voice AI agents that are deployed today are closer to classic IVR-style systems of the past than agentic systems of 2026 (though, admittedly, with much better TTS).

So, even though agentic use cases are well on their way to dominating the world of text models–think of Claude Code’s meteoric rise–many production voice-based systems remain stuck in late 2024.

There are two related reasons that explain this status quo.

Model intelligence gains often come at the cost of increased reasoning time

The most popular models used for production voice agents today include GPT-4o (released in May of 2024), GPT-4.1, and Gemini 2.5 Flash (both released in April 2025), all of which reflect training techniques from more than 18 months ago. A year and a half might not sound like much in real terms, but it’s several generations behind the current state-of-the-art.

However, as models have gotten smarter, inference times have increased–for today’s frontier models, inference times can be in excess of several seconds. In a text-driven chat interaction, this less-than-instantaneous response time is unremarkable. But for voice agents, this increase in latency creates interactions that feel awkward, robotic, and stilted.

For voice agents powered by a component stack, ASR and TTS both contribute their own latency to the end-to-end reasoning pipeline. Older, legacy models perform considerably worse on reasoning, tool calling, and instruction following compared to the latest generation, but they offer one compelling advantage: faster reasoning. So by using an older model, teams stretch the overall latency budget further, albeit at the cost of model intelligence.

We lack a great harness for real-time interactions

The second problem is that we lack good harnesses for voice AI. In order to have a functional agentic system, you need a harness–a set of specialized primitives that wrap around the underlying model to handle details other than inference, such as memory usage or tool calling.

Having sacrificed model intelligence in order to keep latency under control, most voice agents need some alternative means of ensuring desired behaviors. Less-intelligent models often struggle to cope with ambiguity, so many approaches rely on a set of deterministic rules (usually defined in a node builder or similar system) to govern the conversation.

Deterministic guidance can help bridge the gap, improving instruction following behavior and generally keeping the model on track over the course of a conversation. But restricting the agent’s behavior to a narrow set of paths in this way often produces extremely unnatural conversational dynamics, and (ironically) can actually contribute to end-to-end latency.

Compare this node-based approach with modern agentic harnesses, which assume ambiguity and design systems around how to elegantly handle that. What’s unique about the voice AI space is the demand for speed; agentic voice experiences don’t just need smart models and great harnesses–they need a system that works in real-time, sounds natural, and doesn’t have to wait on thinking tokens before responding.

The Foundations of Agentic Voice AI

So what do natural, agentic voice systems look like? They have three properties:

First, they’re fast. Speed is non-negotiable in agentic voice systems. If you’re not consistently under ~1s of end-to-end latency, you’re already too slow. If your agent is built using a component stack, your text LLM needs to consistently deliver a TTFT (time to first token) at or below ~500ms, to allow for the additional latency cost of ASR and TTS. Speech-to-speech systems, rather than component pipelines, are generally the best path to achieving the necessary speed. Ultravox, for example, is a speech-native system with an end-to-end latency of ~900ms.

Secondly, they’re fluid. Agentic voice systems need to seamlessly call tools and manage the conversation state throughout a multi-turn interaction, without affecting speed or naturalness. Fluidity also means handling the ambiguity that arises in natural human communication–when the human speaker goes “off-script”, the agent needs to be able to adapt on the fly. This requires models that are exceptional at instruction following and tool calling, but also intelligent enough to respond gracefully to situations not explicitly described in the prompt. And realistically, if you’re not using 2026 models, you’re not going to get there.

And finally, agentic voice experiences need to be fluent. Users shouldn’t feel like they’re talking to a multi-faceted agentic system. Behind the scenes, there may be multiple models, threads, and other complex patterns making sense of the conversation state, but conversing with the model should feel as natural as talking to another human.

At Ultravox, we’ve designed our system from the beginning around these principles–fast, fluid, fluent. We have the fastest, smartest model available today for speech, and we’re designing the most effective harness for managing complicated, long-running agentic voice conversations. Over the next few months, we’ll be releasing a series of articles on the design patterns, primitives, and system architecture that we believe will empower teams to design and build truly agentic voice AI systems.

Let’s take Voice AI into 2026.