Text (System Prompt) + Audio Input
Text Tokenizer
~300 M Params
OD: 768
Audio Encoder
~20 M Params
OD: 4096
Text Embedder
Audio Projector
OD: 4096
EMBEDDING MERGE
Llama 3.1 Output
Model Architecture
We've extended Meta's Llama 3 model with a multimodal projector that converts audio directly into the high-dimensional space used by Llama 3. This direct coupling allows Ultravox to respond much more quickly than systems that combine separate ASR and LLM components. In the future this will also allow Ultravox to natively understand the paralinguistic cues of timing and emotion that are omnipresent in human speech.
One Size Doesn't Fit All
Llama
Our latest model with function calling capabilities. Good for small, well-defined tasks.
Mistral
Our latest model with function calling capabilities. Good for small, well-defined tasks.
More Soon!
Our latest model with function calling capabilities. Good for small, well-defined tasks.
COMING SOON
FAQ
© 2024 Fixie
hello@fixie.ai