Aug 27, 2025

Beyond Benchmark-Maxxing: Measuring Open Source Models as Real-World Agents

The months leading up to GPT-5's release witnessed an dizzying wave of new language models. This summer alone introduced Kimi-K2, Grok 4, GLM-4.5, Qwen Instruct, GPT-OSS, GPT-5 — and we're only halfway through the year.

Each technical report heralds another state-of-the-art model, but this rapid pace raises an critical question: how can we effectively evaluate these models for real-world applications?

Ultravox is an open-source speech language model built on existing open-weight language models. Unlike traditional systems that rely on a separate transcription step, Ultravox is trained to understand speech directly. This design makes it easy to integrate the latest LLM breakthroughs into our platform.

Though the pace of new releases excites both us and our customers, we’ve discovered our benchmark-based excitement diminishes once we evaluate them in real-world conditions. To help inform our decision about which of these models are best to offer to our customers, we needed a more systematic way to evaluate models on real-world performance.

To accomplish this, we built a benchmark tailored to voice agents (though we suspect it’s useful for conversational agents more generally). Our goal is to measure how well a model can engage in rapid, natural dialogue while still following instructions, calling tools, and minimizing hallucinations. In this blog post, we'll share why we built VoiceAgentBench, what it measures, and how we interpret the results (if you just want to see the numbers, click here).

Which Benchmarks Should I Trust?

When Claude 3.7 Sonnet was released in February, many were underwhelmed by its seemingly modest benchmark improvements. Their blog post stated:

…in developing our reasoning models, we've optimized somewhat less for math and computer science competition problems, and instead shifted focus towards real-world tasks that better reflect how businesses actually use LLMs

The simultaneous release and subsequent popularity of Claude Code revealed the truth—benchmarks don't tell the whole story. While new benchmarks provide fresh perspectives on model capabilities, they also create targets for future models to aim for. Hitting these targets generates buzz, but real-world applications extend far beyond these narrow measurement points across a much broader landscape.

Beyond concerns like data contamination, this framing—along with insights from experts in post-training(1)—suggests that benchmark performance is increasingly a choice made during training. As a result, their diagnostic value diminishes over time, enabling a steady stream of state-of-the-art results that incrementally push performance forward.

So what can we do to counter leaderboard-chasing and find meaningful signals? We see two effective approaches:

Compare performance across multiple benchmarks (we like Artificial Analysis for this)
Develop a custom benchmark for your own application

While we carefully monitor various benchmarks, we've discovered that none perfectly addresses our users' specific needs. A custom benchmark remains the most effective way to measure the capabilities that truly matter for voice agents.

Textual Evaluations for Voice Agents

Ultravox works by pairing a speech encoder and multimodal adapter with pretrained language models. We train our adapter using a teacher-student paradigm, such that the language model treats speech input in the same way as text, transferring all of the powers of unimodal LLMs into the audio domain. To choose language models for Ultravox integration, text-domain performance gives us a sneak peek on how a trained Ultravox model will perform in the audio-domain.

For most agentic AIs, three capabilities matter most: tool calling, instruction-following, and producing hallucination-free responses. Benchmarks exist for each but reported scores are often treated as absolute, when reality they vary greatly with context. A model might call the right tool every time in single-turn experiments, but fail when a tool call is required in the middle of a long conversation.

Voice agents face especially demanding conditions:

Multi-turn Conversations: Guiding dialogue toward goals across several turns, while tracking state and anticipating user behavior.
Long-Horizon Tool Calling: Decisions about which tools to call (and when) often span multiple turns, not just one.
Diverse, Real-World Scenarios: Handling varied tools, prompts, tasks, languages and user behaviors—including rare cases—without hallucinating.

If we base our evaluation on existing benchmarks, we want to make sure that they are measuring agentic capabilities under these conditions.

Do Existing Benchmarks Match These Conditions?

Of the many benchmarks that exist for agentic AI, two of the most widely reported are:

TauBench: Tests multi-turn, long-horizon assistance in only three manually constructed agents, each with multiple simulated users. Users drive conversation, and tasks are intentionally complex, so even the top models score below 50% on the reliability focused pass^4 metric—a reflection of the difficulty, not necessarily poor everyday performance.
- Main limitation: the lack of agent diversity. With only three, TauBench can’t reflect the breadth of situations a voice agent will face. It’s excellent for stress-testing but tells us little about performance in the wide variety of scenarios where a voice agent should succeed.
BFCLv3: Evaluates multi-turn function calling with crowd-sourced tools in multiple languages. To control variability, it uses fixed “ground-truth” trajectories (rather than simulating users) and resets conversation history to that state before each model turn—reducing long-horizon challenges into a series of single-turn calls.
- Main limitation: while it covers more agent diversity, resetting to the ground-truth state strips the benchmark of the reasoning nuance and complexity required for long-horizon tool calling.

These are far from the only benchmarks in the field—but because they’re so frequently reported, they shape much of the public narrative about model capability. Both constrain their setups to make results verifiable. In doing so, TauBench sacrifices agent diversity, and BFCLv3 sacrifices long-horizon realism. Neither captures tool use, hallucination resistance, and instruction-following in the unpredictable, multi-turn reality of voice agents.

That’s why we built our own evaluation framework and an internal benchmark—to test these capabilities under the same varied, real-world conditions our agents will actually face.

VoiceAgentBench: Scalable and Diverse Evaluations

To go beyond leaderboard chasing, we built VoiceAgentBench—a benchmark designed specifically to test models in the conditions that voice agents actually face: multi-turn conversations, long-horizon tool use, and diverse real-world scenarios.

Framework Design

Existing benchmarks rely on hand-labeled ground-truth calls, which limits scalability and diversity. We designed the simulation framework for VoiceAgentBench to overcome this:

AgentProfiles and UserProfiles: We simulate both sides of the conversation using LLMs, where an AgentProfile defines the agent prompt and tools, and a UserProfile defines the user prompt. Together they form a scenario.
ToolResponder: Instead of static databases, we simulate realistic tool outputs via structured JSON responses. This makes the framework scalable to arbitrary tools without hand-curation.
LLM-as-Judge: Following recent work [cite], we use LLMs to grade conversations. We craft three rubrics: tool calling, hallucination, and instruction-following. Each rubric breaks down into criteria that pass/fail, giving us fine-grained insights. Overall score on a single scenario represents the average pass rate among these three rubrics.

This setup removes the biggest bottlenecks of manual annotation and enables us to test a much wider range of agents than TauBench or BFCLv3. This allows us to create AgentProfiles and UserProfiles for evaluation that match the full diversity that we find in our call logs, and continue to update those as voice agent use-cases evolve.

Figure 1: The VoiceAgentBench framework design, consisting of scenario sampling (left), trajectory rollouts (center), and evaluations by LLM-as-judge models (right).

Results

From select partners in our logs, we sampled 36 recent AgentProfiles and paired each with two UserProfiles, yielding 72 scenarios. We fixed these scenarios as our VoiceAgentBench benchmark and scored multiple leading proprietary and open-source models on the scenarios.

We report the distribution of overall scores for leading reasoning and non-reasoning models in Figure 2, where we show surprising mediocrity from a few of the most talked-about open source models (Kimi K2, Qwen 3 235B Instruct, GPT-OSS-120B). These models do not demonstrate significant advances over our existing Ultravox lineup. Conversely, GLM-4.5 stands out as a competitive open-source alternative to proprietary models on real-world VoiceAgentBench scenarios.

Figure 2: Box plots showing the distribution of overall scores for each non-reasoning (left) and reasoning (right) model on VoiceAgentBench. The labeled numbers represent the median overall score for each model on the dataset, and the boxes represent the interquartile range (IQR), with whiskers to 1.5x the IQR and outliers circled.

These surprising results made us wonder—are we missing something? While the overall trends on VoiceAgentBench aligned with our intuition, some open-source models were not meeting our expectations. We compared mean mean scores for each model in VoiceAgentBench to their TauBench Airline scores in Figure 3.

When comparing VoiceAgentBench scores against TauBench, we find strong correlation (r=0.83), validating that our benchmark reflects real capability. At the same time, it’s clear that narrow benchmarks are missing some important dynamics. Models that excel on TauBench do not always achieve corresponding success on VoiceAgentBench, and open-source models in general appear to underperform in real-world scenarios. This gap underscores the importance of having application-grounded evaluations—benchmarks tuned for comparability can miss how models behave in the wild.

Figure 3: Comparison between mean VoiceAgentBench scores and TauBench Airline accuracy (pass^1). Pearson’s r=0.83. (*) **indicates that reasoning levels are not controllable in the TauBench environment

From these results, we can draw some high-level takeaways:

The best open weight models are catching up (but most seem optimized for leaderboard performance over real-world performance). GLM-4.5 nearly matches the top proprietary non-reasoning models (gpt4o, gpt-4.1), showing how quickly open models are improving. But the picture shifts across benchmarks (Fig 3): open-source models that shine on narrow leaderboards like TauBench may not always perform better on VoiceAgentBench’s real-world scenarios.
GPT-OSS is not suitable for real-time voice. Like everyone, we were excited by OpenAI’s initial release of GPT-OSS line of models. Unfortunately, we did not find the high-reasoning variants to be a step-change in capabilities, and the low-reasoning variants (still not clearly suitable for low-latency interactions) performed even worse on VoiceAgentBench.
Reasoning models excel—but are too slow. Latency is uniquely important in voice agent deployments, but model advancements have been more agnostic to this requirement. There is a need for additional investment and development to push non-reasoning models forward.
Deployment matters. Our Llama 3.3-based Ultravox model consistently outperforms vanilla Llama. Part of this comes from integration improvements (chat template, tool parser), but also from sampling bias: the AgentProfiles in the dataset come from Ultravox logs, where prompts were tuned for Ultravox’s template. This sampling bias favors Ultravox, but it highlights a broader lesson: prompt optimization can meaningfully raise performance, meaning most other models’ scores should be seen as lower bounds.

While some open-source models underperformed on VoiceAgentBench, their relative strength on TauBench highlights a different story: these models are highly capable but often optimized for leaderboard metrics rather than real-world reliability. This reflects the incentives shaping development—open providers race to win adoption through benchmark scores, while proprietary teams may prioritize robustness and user experience to drive retention.

At Ultravox, we see an opportunity to close that gap. VoiceAgentBench brings evaluation closer to the reality of user interactions—messy, diverse, and long-horizon. With our simulation framework, we can also shift the focus of model training from synthetic benchmarks to practical outcomes. For us, the most desirable result isn’t topping a leaderboard—it’s enabling models that perform reliably in real-world voice applications.

Looking Forward

Open-source models are improving at a remarkable pace. Just a year ago they trailed far behind proprietary ones, but today, open models like GLM-4.5 are approaching parity with their closed counterparts. Yet VoiceAgentBench shows that leaderboard gains don’t always carry over to long, messy, real-world conversations—where reliability matters most.

For Ultravox, this creates both a challenge and an opportunity. The challenge is that headline benchmarks can obscure what actually matters in production. The opportunity is that open-source models, precisely because they are flexible and adaptable, can be pushed further. Our low-latency speech stack makes it possible to deliver these models directly into customer applications, and VoiceAgentBench gives us the visibility to understand where they succeed, where they falter, and how to close the gap.

We’re building VoiceAgentBench not just for ourselves but for the developer community. Our simulation framework design makes our approach scalable to diverse agents and scenarios. That’s why we’ll be rolling out the framework behind VoiceAgentBench inside the Ultravox platform. Paired with Blocky for automated prompt refinement and future post-training advances, this will create a new kind of voice-agent development loop: one where evaluation, optimization, and deployment reinforce each other.

Benchmarks will always make headlines. But for us (and for our users) the real test is performance in live conversations. VoiceAgentBench is how we cut through the noise, focus on what counts, and help developers build the next generation of voice agents.

HELLO@ULTRAVOX.AI

HELLO@ULTRAVOX.AI

HELLO@ULTRAVOX.AI