Aug 29, 2025

UltraVAD: The first context aware, audio native endpointing model

Today we're open-sourcing UltraVAD, our version of a smart endpointing model that we use for running production traffic on Ultravox Realtime.

UltraVAD is both audio-native (i.e., it does not rely on ASR) and leverages conversational context to make highly accurate predictions about when a user is done speaking. In our analysis, it's the most accurate VAD model for situations that require conversational context to make an accurate prediction.

Model weights: https://huggingface.co/fixie-ai/ultraVAD

Introduction: The Turn-taking Problem in Voice AI

Turn-taking is the subtle dance of knowing when to speak, and its one of the harder problems in Voice AI. Humans do it effortlessly, leaning on a blend of conversational context, linguistic signals, and paralinguistic cues like pauses, intonation, and pitch. What feels instinctive to people becomes three engineering problems:

  1. Endpointing - detecting whether a user has finished speaking

  2. Interruption handling - detecting whether a user is attempting to interrupt or backchanneling (e.g. “uh-huh”, “yeah”)

  3. Multi-party turn-taking - handling multiple speaker channels

In this blog, we’ll mainly focus on the first of these: endpoint detection.

In the past year most voice AI systems have graduated from simple silence-duration heuristics to neural endpointing models (more on the differences in the following section).

At Ultravox.ai, we frame endpointing as a next-token prediction task. By leveraging our existing Ultravox audio projector, we fuse dialog context, LLM semantics, and paralinguistic cues to estimate end-of-turn probability in real time — reliably deciding when a user is done speaking across 26 languages.

Below, we share the design choices, training lessons, and practical insights that got us there.

Existing approaches to endpointing

A. Voice Activity Detection (VAD)

The simplest approach streams audio packets through a VAD (e.g. Silero VAD) which gives it a “likelihood of human speech” score. After a threshold of “no human speech” packets, the orchestration layer deems the turn over and triggers the LLM to respond.

The problem with this approach is that duration of silence is the only signal. Real conversations rely on richer semantic and paralinguistic cues. Using silence as the only cue leads to many premature cutoffs.

B. Text-native neural endpointing

These systems operate only on transcripts, waiting for ASR output before deciding if the user is done.

Problems:

  • No paralinguistic awareness. They miss cues like falling pitch or dragged out prosody.

  • ASR fragility. A mis-transcription can completely change the meaning of the sentence and lead to inaccurate endpoint classification.

  • Inherent latency. You’re bottlenecked on ASR transcription latency and the reliability of your ASR vendor chain.

The result can feel unnatural at scale: long pauses, awkward interruptions, broken phrases. Read more here on why we think component systems are not the future of voice AI.

C. Audio-native neural endpointing

Some systems (e.g., Krisp, Pipecat) skip ASR and process raw audio directly, capturing paralinguistic cues. However, these models only consider the last user turn. Without semantic context, they struggle when contextual meaning determines whether a turn is complete:

  • Complete: “What’s your area code?” → “408.” (end of turn)

  • Likely continuing: “What’s your phone number?” → “408…” (not done)

Even humans labeling these without the prior dialog can’t know for sure; context is essential.

D. Text-Audio fusion

Multimodal models combine transcripts and audio. This is what UltraVAD is, and we chose this path because it can leverage both semantic context and paralinguistic signals.

How we built UltraVAD

We designed around two principles:

  • Audio-native (to capture paralinguistic cues and not be ASR dependent)

  • Context-aware (to use dialog history)

2.1 Lessons from V1: a supervised binary classifier

Our first version used a supervised learning approach. We trained an LLM to work as a binary classifier and fine-tuned on conversational samples with binary labels. Each training sample consists of a previous turn, and last user turn, along with an end of turn true/false label. We used a foundation model like gpt-4o to classify these samples.

*Previous turn:* “How’s the weather looking?
*Current turn:* “It was warm today.  **True

*Previous turn:* “How’s the weather looking?
*Current turn:* “It was, um—”  **False

Where v1 struggled

  • Context dependent. Numbers, lists, and partial utterances look like stops in isolation: “My number is 408…” vs “My area code is 408.” It’s hard to gather context dependent samples without missing edge cases.

  • Dependence on outside classifier. Depends on the foundation model to classify samples. We noticed that foundation models also don’t have good endpoint distributions for languages other than english.

  • Ambiguous boundaries. Binary labels in the training data force boundaries on turns that are ambiguous.

We ultimately realize that endpointing is difficult to capture with binary-labeled samples; it’s better modeled as a probability distribution learned over a conversational corpus.

2.2 UltraVAD: unsupervised next-token prediction task

TL;DR: We teach an LLM an end-of-turn (EOT) token distribution via unsupervised training, then make it audio-native with an Ultravox projector. This lets the model use both textual context and audio, outperforming text-only and audio-only endpointing while scaling easily to new languages.

Step 1: LLM training stage

We take an existing Llama-8b model and fine-tune it on multi-turn conversational data. This data is synthetically generated, with additional end-of-turn <eot> tokens placed after valid stopping points in the conversation. During this stage, the LLM learns a distribution over the end of turn token, and we use that distribution along with a decision boundary for end of turn classification.

<assistant> "How's the weather looking?" <user> "It was, um- warm today" <eot

With this approach we could use unsupervised finetuning on an unbounded corpus of conversational data, instead of supervised learning on carefully curated true/false pairs.

We also found that this approach extends well to new languages. In the binary classification approach, translating the sample could actually change the boundary, so we would have to rerun the end of turn classification on new language datasets. In addition, for some languages (such as Spanish), foundation models have poor end of turn classification abilities, so our v1 approach couldn’t support those languages. With this new approach, any conversational corpus could be trained on, which meant we don’t have to rely on an external classification model to distill from.

For each additional language we wanted to support, we simply applied a translation transformation to our existing corpus and trained on it. Now, our endpointing model supports 26 languages, and can be easily extended to support more in the future.

Step 2: Adding the Ultravox projector to make our model audio native

After text-based finetuning, we integrate the Ultravox audio projector—imbuing our endpointing model with audio native abilities.

Our training approach:

  1. Initialize the projector using a pretrained Ultravox projector, already optimized for noisy, real-world speech.

  2. Finetune on synthetic dialogue data, aligning the audio embeddings with our end-of-turn transformer.

Because the projector is pre-trained for robustness (handling background noise, varying mic quality, and overlapping speech), the endpointing model inherits these strengths, and delivers reliable turn-taking even in the same chaotic conditions that Ultravox is accustomed to.

So far our model is audio native but it has yet to learn how to use paralinguistic cues. In our next phase of training, we are using single turn samples to further train our projector to recognize intonation, pitch, and pauses. We believe this will improve our performance on semantically ambiguous cases. This is an ongoing effort but an important update that we are excited to release in the coming months.

Evaluating performance

At the time of this writing, there are not many ways to evaluate neural endpointing models. Pipecat’s Smart-Turn V2 offers helpful single-turn datasets, but we saw a gap for multi-turn, context-dependent evaluation. So we’re releasing our own benchmark focused on contextual turn-taking.

On a held-out set of 400 context-dependent samples, we compare UltraVAD to Smart-Turn V2 (audio-native). We chose Smart-Turn V2 because it is the only other opensource audio native endpointing model. The default recommended thresholds are set for both models, and here are the results.


UltraVAD

SmartTurn V2

Accuracy

77.5%

63.0%

Precision

69.6%

59.8%

Recall

97.5%

79.0%

F1-Score

81.3%

68.1%

AUC

89.6%

70.0%

For consistency, we also evaluate UltraVAD on Smart-Turn v2’s single turn datasets:


UltraVAD

SmartTurnV2




opheus-aggregate-train

93.7%

N/A

opheus-aggregate-test

N/A

94.3%

Note* We didn’t have access to pipecat’s eval dataset so we evaluated on their train dataset (three orpheus datasets), then averaged the scores. We took Smart Turn V2’s reports of their evaluation on a held out test set for the same synthetic datasets.

Comparing their reported score to our aggregate eval score on their training set, the results are within a 1 percentile difference.

Deploying UltraVAD in production

Now that we have the endpointing model, how do we actually use it in a realtime voice service? A naive way would be to run the endpointing model after every audio packet that streams in, but not only is this expensive (we’d be running inference every 30 milliseconds), we also take an additional latency hit if we only run the LLM+TTS after getting a response from the endpointing model.

In our Ultravox realtime platform, we use VAD as the first line of defense. Recall that VAD looks at audio stream chunks and gives each chunk a “human speech score.” Once our VAD detects a few chunks of silence, we speculatively run a forward pass on UltraVAD with the last user audio turn, along with the textual conversation history.

At the same time, we also speculatively run our Ultravox model. Once we get back a reply from UltraVAD, the first packets from our TTS model have not come back yet (latency here is around 500ms). Therefore, as long as UltraVAD’s forward pass stays under the time-to-first-audio packet, we don’t “pay” for that latency. Given this latency headroom, we chose to build a larger, more powerful model rather than a smaller one that sacrifices accuracy for speed.

Open Sourcing UltraVAD

Today we are also open sourcing UltraVAD’s weights on huggingface! We will be continuously pushing updates to make this model better. If you are using our realtime stack at Ultravox.ai, this model is already turned on by default.

We also plan on improving our turn-taking model in the future to handle interruptions, backchannels, and group conversations - so stay tuned!