Text (System Prompt) + Audio Input

Text Tokenizer

~300 M Params
OD: 768

Audio Encoder

~20 M Params
OD: 4096

Text Embedder

Audio Projector

OD: 4096

EMBEDDING MERGE

Llama 3.1 Output

Model Architecture

We've extended Meta's Llama 3 model with a multimodal projector that converts audio directly into the high-dimensional space used by Llama 3. This direct coupling allows Ultravox to respond much more quickly than systems that combine separate ASR and LLM components. In the future this will also allow Ultravox to natively understand the paralinguistic cues of timing and emotion that are omnipresent in human speech.

One Size Doesn't Fit All

Llama

Our latest model with function calling capabilities. Good for small, well-defined tasks.

Mistral

Our latest model with function calling capabilities. Good for small, well-defined tasks.

More Soon!

Our latest model with function calling capabilities. Good for small, well-defined tasks.

COMING SOON

FAQ

I plugged the thingy into the whatsit and it still won't work. What gives?

Will this end society?

Do I need to code?

I plugged the thingy into the whatsit and it still won't work. What gives?

Will this end society?

Do I need to code?

I plugged the thingy into the whatsit and it still won't work. What gives?

Will this end society?

Do I need to code?