Text (System Prompt) + Audio Input

Text Tokenizer

Audio Encoder

Text Embedder

Audio Projector

EMBEDDING MERGE

Llama 3.1 Output

Text (System Prompt) + Audio Input

Text Tokenizer

Audio Encoder

Text Embedder

Audio Projector

EMBEDDING MERGE

Llama 3.1 Output

Text (System Prompt) + Audio Input

Text Tokenizer

Audio Encoder

Text Embedder

Audio Projector

EMBEDDING MERGE

Llama 3.1 Output

Model Architecture

We've extended Meta's Llama 3 model with a multimodal projector that converts audio directly into the high-dimensional space used by Llama 3. This direct coupling allows Ultravox to respond much more quickly than systems that combine separate ASR and LLM components. In the future this will also allow Ultravox to natively understand the paralinguistic cues of timing and emotion that are omnipresent in human speech.

One Size Doesn't Fit All

Llama

ULTRAVOX_Llama_8b

ULTRAVOX_Llama_70b

Mistral

ULTRAVOX_Mistral_8b

ULTRAVOX_Mistral_70b

More Soon!

We're working on even more sizes and extending model support. Check back soon for updates.

COMING SOON

FAQ

I plugged the thingy into the whatsit and it still won't work. What gives?

Will this end society?

Do I need to code?

I plugged the thingy into the whatsit and it still won't work. What gives?

Will this end society?

Do I need to code?

I plugged the thingy into the whatsit and it still won't work. What gives?

Will this end society?

Do I need to code?