Sep 9, 2025

UltraVAD is now open source! Introducing the first context-aware, audio-native endpointing model

UltraVAD, our version of a smart endpointing model that we use for running production to Ultravox Realtime, is now open source!

UltraVAD is audio native, meaning it doesn’t rely on automated speech recognition (ASR) and it leverages conversational context to make highly accurate predictions about when a user is done speaking. In our analysis, it's the most accurate VAD model for situations that require conversational context to make an accurate prediction.

For reference, we’ve shared our model weights on Hugging Face.

The turn-taking problem in Voice AI

At its core, turn-taking is a question of knowing when to speak (or when not to speak) in a conversation, and it's one of the harder problems in Voice AI. Most humans do this effortlessly, leaning on a blend of cultural and conversational context, linguistic signals, and paralinguistic cues like pauses, intonation, and pitch. But what feels instinctive to people becomes three engineering problems¹:

  1. Endpointing - detecting whether a user has finished speaking

  2. Interruption handling - detecting whether a user is attempting to interrupt the speaker or backchannel (verbal cues like “uh-huh” or “yeah” that indicate a listener’s engagement)

  3. Multi-party turn-taking - handling multiple speaker channels

In this blog, we’ll mainly focus on the first of these: endpoint detection.

In the past year most voice AI systems have graduated from simple silence-duration heuristics to neural endpointing models.

At Ultravox.ai, we frame endpointing as a next-token prediction task. By leveraging our existing Ultravox audio projector, we fuse audio, dialog context, and LLM semantics to estimate end-of-turn probability in real time—reliably deciding when a user is done speaking across 26 languages.

Existing approaches to endpointing

The simplest approach streams audio packets through a Voice Activity Detector (VAD) (e.g. Silero VAD) which assigns a “likelihood of human speech” score. After a threshold of “no human speech” packets is reached, the orchestration layer deems the speaker’s turn over and triggers the LLM to respond.

The problem with this approach is that duration of silence is the only signal, which leads to a higher rate of premature cutoffs. Real conversations rely on richer semantic and contextual cues.

Text-native neural endpointing systems operate on transcripts, waiting for ASR output in order to decide if the user has finished speaking.  Because it relies on transcripts, this method sacrifices paralinguistic awareness–cues like changes in pitch, intonation, or rhythm. Minor mis-transcriptions (think of homonyms like “break” and “brake”) can completely change the meaning of a sentence and lead to inaccurate endpoint classification. And there’s an inherent bottleneck introduced by ASR transcription latency as well as the overall reliability of your ASR vendor chain. The result can feel unnatural at scale: long pauses, awkward interruptions, broken phrases.

Some systems (e.g., Krisp, Pipecat) use audio-native neural endpointing, skipping ASR and processing raw audio directly. This audio-native approach retains the kind of paralinguistic cues that text-native neural endpointing misses. However, these models only consider the most recent user turn, meaning they tend to struggle when contextual meaning determines whether a speaker’s turn is complete:

Prior dialog

Response

Type

What's your area code?

408 (end of turn)

Complete

What's your phone number?

408…(turn not completed)

Likely continuing

In the above example, prior dialog provides essential context in the conversation–without it, even a human listener would struggle to know for sure whether the speaker is likely to continue.

Multimodal models combine transcripts and audio to leverage both semantic context and audio signals. Audio-native processing allows for direct audio input without a transcription step, while preserving the dialog history allows for better context awareness. After evaluating other options, we chose to move forward with this model type for UltraVAD.

Evaluating performance

At the time of this writing, there are not many ways to evaluate neural endpointing models. Pipecat’s Smart-Turn V2 offers helpful single-turn datasets, but we saw a gap for multi-turn, context-dependent evaluation. So we’re releasing our own benchmark focused on contextual turn-taking.

For our benchmark, we synthetically generate these context-dependent samples, and ask a foundation model to label the last user turn of these samples. Then we hand-check these labels to make sure they are correct.

On a held-out set of 400 context-dependent samples, we compared UltraVAD to Smart-Turn V2 (audio-native). We chose Smart-Turn V2 because it is the only other open source audio-native endpointing model. The default recommended thresholds³ are set for both models, and here are the results.


UltraVAD

SmartTurn V2

Accuracy

77.5%

63.0%

Precision

69.6%

59.8%

Recall

97.5%

79.0%

F1-Score

81.3%

68.1%

AUC

89.6%

70.0%

For consistency, we also evaluate UltraVAD on Smart-Turn v2’s single turn datasets:


UltraVAD

SmartTurn V2

opheus-aggregate-test

93.7%

94.3%

Comparing their reported score to our aggregate eval score on their test set, the results are within a 1 percentile difference, while there is nearly a 20 percent improvement in the area under the ROC curve (AUC) score for the context-dependent samples.

Open Sourcing UltraVAD

After building and testing several iterations of a multimodal model, we landed on a version that we felt confident including in our Ultravox Realtime platform. (For more details on that development process, see Context-aware, audio-native endpointing: How we built UltraVAD). If you’re using the Ultravox.ai realtime stack, this model is already turned on by default. 

We are also pleased to announce that an open source version of UltraVAD’s weights is now available on  Hugging Face

Although the current model is audio native, we still need to improve its ability to use paralinguistic cues in conversation. Work is currently underway to train our projector to recognize intonation, pitch, and pauses, which will improve performance on semantically ambiguous cases. With that in mind, we will be continuously pushing updates to make this model better. 

We also plan on improving our turn-taking model in the future to handle interruptions, backchannels, and group conversations - so stay tuned!

¹ We chose to target these three dynamics as they are the most tractable and impactful turntaking problems right now, but there may be other more nuanced parts not covered here. Ie. exponential backoff when two parties both decide to talk at the same time.