Sep 9, 2025

Context-aware, audio-native endpointing: How we built UltraVAD

We recently announced that we're open sourcing UltraVAD, the smart endpointing model that we use for running production traffic in Ultravox Realtime. UltraVAD is a multimodal model, meaning it uses a combination of transcripts and audio to evaluate whether or not a user has finished speaking.

We started work on UltraVAD by designing around two key principles:

Audio-native: the model should process direct audio input without reliance on a transcription step
Context-aware: the model needed to reference dialog history to better understand contextual meaning

In this article, we’ll be taking a deeper dive into how the team built UltraVAD, and what we learned from some of our earlier versions.

Lessons from v1: a supervised binary classifier

Our first version used a supervised learning approach. We trained an LLM to work as a binary classifier and fine-tuned on conversational samples with binary labels. Each training sample consists of a previous turn, and last user turn, along with an end of turn true/false label. We used a foundation model like gpt-4o to classify these samples.

*Previous turn:* “How’s the weather looking?”
*Current turn:* “It was warm today.” → **True

*Previous turn:* “How’s the weather looking?”
*Current turn:* “It was, um—” → **False

Unsurprisingly, we found the v1 approach struggled in a few key ways:

Context awareness. Numbers, lists, and partial utterances look like stops in isolation: “My number is 408…” vs “My area code is 408.” It’s hard to gather context-dependent samples without missing edge cases.
Dependence on an outside classifier. Our v1 leaned on a foundation model to classify samples, but we found that these foundation models generally don’t have good endpoint distributions for languages other than English.
Ambiguous boundaries. Assigning binary labels in the samples forces boundaries on turns that are ambiguous, which adds noise to the training data.

We ultimately realized that endpointing is difficult to capture with binary-labeled samples; it’s better modeled as an emergent distribution learned over a conversational corpus.

Improvements in v2: self-supervised next-token prediction task

In our second iteration, we tried a new approach: teaching an LLM to perform self-supervised training with an end-of-turn (EOT) token, then making this process audio-native with an Ultravox projector. The idea was to allow the model to use both textual context and audio, which would outperform models that used text or audio alone and make it easier to scale to new languages.

For this iteration, we started with an existing Llama-8b-Instruct model. To refine its training, we synthetically generated multi-turn conversational data with additional end-of-turn <eot> tokens placed after valid stopping points in the conversation.

This process taught the LLM a distribution over the end-of-turn token, and we could then use that distribution along with a decision boundary for end-of-turn classification.

<assistant> "How's the weather looking?" <user> "It was, um- warm today" <eot

With this approach we could apply self-supervised finetuning¹ on an unbounded corpus of conversational data, instead of supervised learning on carefully curated true/false pairs.

We also found that this approach extends well to new languages. In the binary classification approach, translating the sample could actually change the boundary, so we would have to rerun the end-of-turn classification on new language datasets.

In addition, for some languages (such as Spanish), foundation models have poor end-of-turn classification abilities, meaning our v1 approach couldn’t support those languages. With this new v2 approach, any conversational corpus could be trained on, which meant we don’t have to rely on an external classification model to distill from. Translating the training corpus preserves the endpointing distributions across languages more faithfully than reclassifying with a foundation model.

For each additional language we wanted to support, we simply applied a translation transformation to our existing corpus and trained on it. Now, our endpointing model supports 26 languages, and can be easily extended to support more in the future.

Adding the Ultravox projector to make our model audio native

After text-based finetuning, we integrated the Ultravox audio projector—imbuing our endpointing model with audio-native abilities.

Our training approach:

Initialize the projector using a pretrained Ultravox projector, already optimized for noisy, real-world speech.
Finetune on synthetic dialogue data, aligning the audio embeddings with our end-of-turn transformer.

Because the projector is pre-trained for robustness (handling background noise, varying mic quality, and overlapping speech), the endpointing model inherits these strengths, and delivers reliable turn-taking even in the same chaotic conditions that Ultravox is accustomed to.

Deploying UltraVAD in production

Now that we have the endpointing model, how do we actually use it in a realtime voice service? A naive way would be to run the endpointing model after every audio packet that streams in. Not only is this expensive–we’d be running inference every 30 milliseconds–but we’d also take an additional latency hit if we only run the LLM+TTS after getting a response from the endpointing model.

Instead, we use VAD as the first line of defense in our Ultravox Realtime platform. VAD models (e.g. Silero VAD) look at audio stream packets and give each chunk a “likelihood of human speech” score. After a specified threshold of “no human speech” packets is reached, we speculatively run a forward pass on UltraVAD with the last user audio turn, along with the textual conversation history.

In order to keep latency low, we also speculatively run our Ultravox model at the same time. Generally we receive a reply from UltraVAD before the first packets from our TTS model have come back (latency here is around 500ms). Therefore, as long as UltraVAD’s forward pass stays under the time-to-first-audio packet, we don’t “pay” for that latency. Given this latency headroom, we chose to build a larger, more powerful model rather than a smaller one that sacrifices accuracy for speed.

What's Next?

We’ve got more improvements for UltraVAD in the works, including improvements to how the model handles paralinguistic cues in conversation as well as group conversations.

In the meantime, model weights for the current version of UltraVAD are open source and available on Hugging Face.

If you'd rather skip straight to the fun part and try it out yourself, you can also check out a live demo of the Ultravox assistant, or create a free Ultravox account to and follow our Quickstart guide to start chatting with your own custom voice AI agent!

—

¹ We apply a cross-entropy loss on all tokens, as opposed to just the user response.

² Threshold: 0.1 for UltraVAD and 0.5 for SmartTurnV2