Ultravox: An open-weight alternative to GPT-4o Realtime
Oct 30, 2024
Today we’re introducing Ultravox, a family of models built on top of open-source foundations trained specifically for enabling real-time conversations with AI. Unlike most Voice AI systems, Ultravox does not rely on a separate automatic speech recognition (ASR) phase. Rather, the model consumes speech directly in the form of embeddings. This is the first step in enabling truly natural, fluent conversations with AI. We encourage you to try our demo.
We’ve trained versions of Ultravox on Llama 3.1 8B & 70B, Mistral Nemo, and Gemma 2 27B. Model training code is available on Github and the weights are on HuggingFace. For training your own version of Ultravox on another model or different data sets, see this section of the README.
Ultravox shows speech understanding capabilities that are on-par with proprietary solutions like OpenAI’s GPT-4o and markedly better than other open source options. Our primary method of evaluation is zero-shot speech translation, measured by BLEU, as a proxy or general instruction-following capability (higher numbers are better):
How Ultravox Works
The majority of AI Voice implementations today rely on pipelining together speech recognition (ASR), model inference, and text to speech (TTS). While this system works for basic use cases, it fails to scale to more complex use cases (background speakers, noisy environments, group conversations, etc) or where the dialogue is less formulaic (i.e., conversations without very clear turn taking).
The reason for this is relatively simple: pipeline systems suffer from high information loss.
Communication is not just about the words that we say, but how we say them and in what context. This includes emotion, intonation, tenor, and other paralinguistic signals. When speech goes through an ASR process, those additional signals are stripped away.
Ultravox is designed to address these shortcomings by teaching the model to respond to speech directly. For the current version of Ultravox, this is done by training an adapter that projects into the same embedding space as the underlying LLM.
For now, Ultravox continues to output text, but future versions of the model will emit speech tokens directly. We’ve chosen to focus on the speech understanding problem first, as we think it’s the biggest barrier to natural-feeling interactions.
To be clear, we are not at the end goal yet. Even though Ultravox’s speech understanding is on par with existing systems, there is still a lot of work to be done. We outline a roadmap below that shares our intended path towards fluency.
Ultravox Realtime
In addition to the core model work, we’re also opening access to Ultravox Realtime, a set of managed APIs that make building on top of Ultravox fast and easy. When compared with OpenAI Realtime, Ultravox sees equivalent latency performance.
Ultravox Realtime has built in support for voices, tool calling, telephony, and many of the other critical pieces necessary for building powerful Voice agents. SDKs are available for most major platforms and you can get started by signing up today. We’re including 30 minutes of free time to get started, after which it's only $.05/min. This makes it considerably cheaper than alternatives.
Ultravox Realtime builds on top of all the work that we’ve already made open source through the Ultravox repo itself and the vLLM project.
Roadmap to Fluency
Our belief is simple: We think useful, productive, and accessible AGI will require models that can operate in the fast-paced, ambiguous world of natural human communication. Whether it’s support agents on the phone, AI employees joining critical meetings, or humanoid robots in the home – AI will never reach its full potential until it can cross the chasm into the “real” world.
The current generation of voice AIs, including proprietary offerings like GPT 4o Realtime, all fall short of this goal. They’re easily overwhelmed by multiple speakers. They fail to properly understand context, emotions, tenor, or other signals that drive the conversation. They still can’t tell when a user is done speaking (VAD continues to be the major driver).
This isn’t to discount progress. Current technology far, far exceeds what was possible with legacy systems like Siri, Alexa, and Google Assistant. But we still have a lot of work to do, and we wanted to share our Roadmap to Fluency, which outlines the major steps that we’ll be taking to achieve truly natural communication with AI.
Level 0: Simplex
This mode describes scenarios where the AI is in "output only" mode (e.g., generating text or speech) or "input only" mode (e.g., transcribing speech). This is akin to how legacy systems like Siri, Alexa, etc work.
Level 1: One-on-one half-duplex
AI alternates between listening and speaking. This is similar to turn-based conversation, where each participant completes their thought before the other responds. In AI, this would mean the system waits for the user to finish their input before processing and generating a response. Once the AI begins responding, it completes its entire response before accepting new input. This is the mode in which most current chatbots and conversational AI systems operate. The model has no awareness outside of the semantic context of emotion, tenor, etc. Conversations are limited to one-on-one in simple environments, free of background speakers or excessive noise.
Level 2: One-on-one full-duplex
The AI processes user input (speech or text) while simultaneously generating and delivering its response. This mode enables more natural, flowing conversations with immediate reactions and adjustments based on user input. The AI could potentially modify its response in real-time, just as humans adjust their speech during conversation. The AI has only basic or crude awareness of emotion, tenor, etc. Conversations are still limited to one-on-one in simple environments, free of background speakers or excessive noise.
Level 3: Group conversations in full-duplex
Similar to level 2, except the model now has the ability to understand more complex environments. It can handle multiple speakers over the same audio channel and is able to effectively distinguish between them. Conversations feel natural and fluent, though perhaps still slightly awkward in moments where the model has failed to understand important social or contextual clues.
Level 4: Human-level understanding
Conversations with the AI are largely indistinguishable from conversations with other humans. They are fast, fluent, and emotionally aware. These AIs are able to effectively participate in everything from 1:1 conversations to group Zoom calls.
The Importance of Open
We’re big believers in the power of open source and open models. This is why we’re committed to always keeping our models open for others to leverage and build on top of. A world with truly open AI is a better world for all of us.
Join Us
If you’re excited by what we’re trying to build, come join us! We’re actively hiring for engineers and research scientists. More here.