Resources

Pricing

Blog

Voice AI Glossary: Key Terms and Concepts Explained

A reference guide to common terms in voice AI, covering core speech technology concepts and the LLM and inference concepts most relevant to modern voice agent systems.

Audio tokens

In speech-native voice AI systems, audio tokens are the discrete units into which an audio signal is divided before being processed by a model. Rather than converting speech to text and reasoning over words, a speech-native model encodes audio as a sequence of tokens, each of which represents a short segment of the audio signal, and processes them directly. This is analogous to how text-based LLMs tokenize written language, but applied to sound rather than text.

The use of audio tokens is what allows speech-native models to reason over the full acoustic signal, including tone, pacing, and prosody, rather than a text approximation of it. The trade-off is that audio token sequences tend to be longer than their text equivalents for the same content, which has implications for context window usage and inference speed.

Automatic speech recognition (ASR)

Automatic speech recognition (ASR) is the technology that converts spoken audio into text. In a component pipeline voice AI system, ASR is the first processing stage: the speech captured from a user's microphone is passed to an ASR model, which produces a text transcript that is then passed to an LLM for inference.

ASR models vary significantly in latency, accuracy, and language support. Additionally, there's a key architectural distinction between two ASR approaches:

Batch ASR, in which the model waits for a complete utterance before producing a transcript.
Streaming ASR, which begins producing a partial transcript while the user is still speaking.

Streaming ASR is generally preferred in voice AI applications because it allows downstream processing to begin sooner, reducing overall response latency.

ASR accuracy is typically measured using word error rate (WER), which quantifies the percentage of words in a transcript that differ from the ground truth. In practice, ASR errors can propagate downstream, leading the LLM to reason over a malformed transcript — one of the structural limitations of the component pipeline approach.

Barge-in

Barge-in refers to a user speaking while a voice AI system is in the middle of producing a response. The terms "barge-in handling" and "interruption handling" generally refer to the same behavior; both describe how the AI agent responds to an unexpected user utterance.

Handling barge-in gracefully — stopping playback, processing the interruption, and responding appropriately — is a meaningful technical challenge and a significant factor in how natural a voice AI interaction feels.

A system with poor barge-in handling will continue speaking after the user has tried to interject, which feels unresponsive regardless of how fast its initial response latency is. The delay between the user beginning to speak and the system stopping playback and beginning to process the new input is sometimes referred to as barge-in latency. For applications where conversational naturalness matters, barge-in handling is as important a consideration as overall response latency.

Component pipeline

A component pipeline is an architectural approach to voice AI in which speech understanding and generation are handled by a sequence of discrete, specialized models: an ASR model converts the user's speech to text, an LLM processes that text and generates a response, and a TTS model converts the response back into spoken audio.

Each component in a pipeline is independently deployable and swappable, which makes the approach modular and relatively straightforward to debug. However, latency accumulates at every stage and every handoff between components, meaning a pipeline system's end-to-end response time is roughly the sum of its parts.

Component pipelines typically rely on text models for reasoning; the LLM at the center of a pipeline has no direct access to the original audio signal (and no meaningful way to decode it). Instead, the LLM reasons over a text transcript, which means paralinguistic information such as tone and prosody can be lost in transcription.

Component pipelines are the more established approach to voice AI and benefit from a mature ecosystem of components, tooling, and deployment patterns. See also: speech-native model.

Read more: Why speech-to-speech is the future for AI voice agents

Context window

The context window is the maximum amount of text, audio, or other input that a model can process in a single inference call. Everything the model can "see" at the time of inference must fit within its context window: the system prompt, the conversation history, any retrieved data, and the current user input.

In voice AI applications, context window size has practical implications for both capability and latency. A larger context window allows the model to maintain longer conversation histories and access more background information, but processing a larger context takes more time, increasing time to first token (TTFT).

As a multi-turn conversation progresses, the accumulated conversation history consumes more of the available context window, which can cause latency to increase gradually over the course of a long call if context is not actively managed. Similarly, a longer, more detailed system prompt might produce better conversation outcomes, but leaves less context window capacity available for storing conversation history and retrieved data.

Endpointing

Endpointing is the process of determining where a spoken utterance ends — identifying the moment at which a speaker has finished their turn and the system should begin processing their input. It is closely related to voice activity detection (VAD), and the two terms are sometimes used interchangeably, though endpointing more specifically refers to the detection of the end of a complete utterance rather than the presence or absence of speech in general.

Accurate endpointing is a prerequisite for low-latency voice AI. If the system endpoints too early, it may cut the user off mid-thought and respond to an incomplete input. If it endpoints too late, it adds unnecessary delay before processing begins, increasing perceived latency. The challenge is that natural speech includes pauses for breath, emphasis, or thinking, none of which should be interpreted as the end of a turn.

Inference latency

Inference latency is the time a model takes to process an input and begin producing output. In the context of voice AI, it most commonly refers to the processing done by the LLM: the time between receiving a prompt and returning the first token of a response. However, the term applies to any model in the system, including ASR and TTS models in a component pipeline.

Inference latency is influenced by a number of factors, including model size, context length, hardware, and infrastructure load. It is typically the largest single source of latency in a voice AI system, and the most variable: the same model can exhibit significantly different latency at different times, depending on how heavily loaded the underlying infrastructure is. Tracking latency at the p95 and p99 percentiles, not just the median, gives a more accurate picture of real-world performance.

Interruption handling

Interruption handling refers to a voice AI system's ability to detect and respond appropriately when a user speaks during the system's own output; situationally determining whether to redirect the conversation, ask a clarifying question, or simply cut in. It encompasses both the technical capability of detecting the interruption (see: barge-in) and the conversational logic that determines how the system should respond to it.

Effective interruption handling requires the system to stop its current output quickly, discard or deprioritize whatever it was in the process of saying, and generate a response that acknowledges the new input in a contextually appropriate way. Systems that handle interruptions poorly — either by ignoring them, responding with confusion, or repeating content the user has already cut off — degrade the conversational experience significantly, even when their baseline response latency is acceptable.

KV caching

KV caching (key-value caching) is an inference optimization technique that stores the intermediate computations for parts of a prompt that remain constant across multiple inference calls. Specifically, this method stores key and value matrices generated during the attention mechanism — the process by which the model determines how different parts of the input relate to each other — rather than recomputing these values from scratch on every call.

In voice AI applications, KV caching is particularly useful for caching long system prompts that remain unchanged across the turns of a conversation, as well as for caching earlier turns of a dialogue history. The trade-off is memory usage: storing key-value matrices for large contexts or long conversations requires significant GPU memory, which can become a constraint at scale. Techniques such as quantizing the KV cache and using grouped-query attention (GQA) are commonly applied to reduce its memory footprint.

Large language model (LLM)

A large language model (LLM) is a neural network trained on large volumes of text data to predict and generate language. In a component pipeline voice AI system, the LLM is the reasoning engine at the center of the interaction: it receives a text transcript of the user's speech, generates a response, and passes that response to a TTS model for synthesis. In speech-native systems, the model processes audio directly rather than text, which changes the nature of the input but not the fundamental role of the model as the reasoning engine of the system.

Modern LLMs vary significantly in size, capability, and inference speed. Larger models tend to produce higher-quality responses, particularly on tasks requiring nuanced instruction following, multi-step reasoning, or tool calling, but at the cost of higher inference latency. Smaller, distilled models are faster but less capable, a trade-off that teams building voice AI systems frequently need to navigate.

Latency

In voice AI, latency generally refers to the delay between a user finishing speaking and the system beginning to respond with audio. It is the primary factor determining whether a voice AI interaction feels natural or disjointed, and it is influenced by every layer of the system; the architectural approach, the models used, the infrastructure they run on, and the application-level design decisions made around them.

Human conversation operates within tight timing constraints: the gap between one speaker finishing and another beginning is typically in the range of 200–300ms. Voice AI systems cannot yet consistently match that window, but the closer they get, the more natural the interaction feels. As a general benchmark, time to first audio under 600ms is the threshold at which most users perceive a voice AI interaction as natural; above one second, the delay consistently disrupts conversational flow.

Latency in voice AI is best understood not as a single number but as a distribution across single turns as well as across calls.

Read more: Understanding Latency in Voice AI Systems

Paralinguistics

Paralinguistics refers to the non-verbal elements of spoken communication. These are the features of speech such as tone, pitch, pacing, rhythm, emphasis, and hesitation that convey meaning beyond the literal content of the words used. A speaker's paralinguistic signals can indicate emotion, confidence, uncertainty, or intent in ways that a text transcript does not meaningfully capture.

In the context of voice AI, paralinguistics is relevant primarily as a distinguishing factor between component pipeline and speech-native architectures. In a pipeline system, the LLM reasons over a text transcript and has no access to the original audio signal; paralinguistic information is lost in transcription. A speech-native model, by contrast, processes the audio directly, and can in principle incorporate paralinguistic signals into its understanding of what the user said and how they said it. This has potential implications for conversational naturalness, emotional responsiveness, and the handling of ambiguous or context-dependent inputs.

Prosodic bridging

Prosodic bridging is a technique in which a voice AI system produces a brief, natural-sounding filler response — such as "Let me check on that" or "Give me a moment to look that up" — while the underlying model is still processing the user's input. The goal is to reduce the silence the user experiences during LLM inference, providing language that fits the conversational context and signals that the system has heard the input and is working on a response.

Human operators use the same technique naturally: when retrieving information or shifting attention during a call, experienced agents fill the gap with brief verbal acknowledgments rather than silence. Used judiciously in a voice AI system, prosodic bridging can meaningfully improve the perceived responsiveness of an interaction without changing the underlying response time. Used too frequently or formulaically, it becomes noticeable and can erode user trust.

Quantization

Quantization is a model optimization technique that reduces the numerical precision of a model's weights — for example, representing values in 8-bit integers (INT8) rather than 32-bit floating point (FP32). The result is a model with a smaller memory footprint that can run inference faster, at the cost of a modest reduction in output quality.

In voice AI applications, quantization is one of the primary tools for reducing inference latency without replacing the underlying model. It is applicable to LLMs, ASR models, and TTS models in a component pipeline, as well as to speech-native models. The degree of quality degradation experienced depends on the model, the quantization approach, and the task. For many voice AI use cases, the quality impact of INT8 quantization is acceptable, while more aggressive approaches such as INT4 may introduce more noticeable degradation.

Speech-native model

A speech-native model is a model that processes and generates audio directly, without converting speech to text as an intermediate step. Unlike a component pipeline, in which an ASR model transcribes audio to text before passing it to an LLM, a speech-native model receives audio as its primary input and reasons over it in that form.

This architectural difference has meaningful consequences. Speech-native models have direct access to the full acoustic signal, including the paralinguistic features — tone, pacing, hesitation, prosody — that are lost when speech is transcribed to text. They also eliminate the ASR stage and the inter-component handoffs of a pipeline system, which removes structural sources of latency and reduces the likelihood of transcription errors propagating through the pipeline.

However, speech-native models tend to be large, and the infrastructure and optimization tooling around them is less mature than for pipeline components. Whether the structural latency advantage translates into faster real-world performance depends heavily on the quality of the underlying deployment.

Read more: Speech-to-Speech Voice Agents: Architecture, Benefits, and How They Work

Speculative decoding

Speculative decoding is an inference optimization technique that uses a smaller, faster draft model to propose candidate output tokens, which a larger target model then verifies in parallel. Because the target model can evaluate multiple candidate tokens simultaneously rather than generating them one at a time, the effective throughput increases, reducing the time required to generate a complete response.

In voice AI applications, speculative decoding is most relevant as a way to reduce LLM inference latency without switching to a smaller model outright. The quality of the output is determined by the target model, not the draft model, so the capability trade-off associated with model substitution is avoided. The technique is most effective when the draft model's predictions are frequently correct, which depends on the similarity between the draft and target models, as well as the nature of the task.

Streaming

In the context of voice AI, streaming refers to the practice of passing partial outputs between processing stages as they become available, rather than waiting for each stage to produce a complete output before proceeding. Streaming can be applied at the ASR, LLM, and TTS stages of a component pipeline, and it is one of the most impactful architectural choices available for reducing end-to-end latency.

In a streaming pipeline, ASR begins producing partial transcripts while the user is still speaking; the LLM can begin processing those partial transcripts before transcription is complete; and the TTS model can begin synthesizing audio before the LLM has finished generating its full response. Each of these overlaps reduces the cumulative delay, bringing the time to first audio significantly closer to the LLM's raw inference time. The trade-off at the ASR stage is that partial transcripts are more likely to contain errors, which the LLM must then reason over.

Time to first audio (TTFA)

Time to first audio (TTFA) is the elapsed time between the end of a user's speaking turn and the moment audio playback begins from the voice AI system. It is the latency metric most directly tied to the user's experience of a voice AI interaction, and the most reliable predictor of whether an interaction will feel natural or disjointed.

TTFA encompasses the cumulative delay of every processing stage between the end of the user's input and the start of the system's audio output. In a component pipeline, that includes VAD, ASR, LLM inference, and TTS; in a speech-native system, it reflects primarily the model's inference time and audio generation speed, although a VAD step may also be included.

As a general benchmark, TTFA under 600ms is the threshold at which most users perceive a voice AI interaction as natural (see AIEWF TTFT benchmark data).

Time to first token (TTFT)

Time to first token (TTFT) is the elapsed time between a model receiving its input and producing its first output token. In a component pipeline voice AI system, TTFT typically refers to the LLM specifically — the delay between the model receiving the ASR transcript and returning the first token of its response. In a speech-native system, the input is audio rather than text, but the concept is the same.

TTFT is a useful diagnostic metric for isolating model inference performance, but it is not directly observable by the end user. In a pipeline system, audio playback cannot begin until the TTS model has processed at least some of the LLM's output tokens; in a speech-native system, audio generation adds a further step. Time to first audio (TTFA) is therefore the more user-relevant latency metric, with TTFT serving as an important component of it.

Tool calling

Tool calling (sometimes referred to as function calling) refers to the ability of an LLM to invoke external functions or APIs mid-conversation — to retrieve data, update records, or trigger actions in other systems — and incorporate the results into its response.

In voice AI applications, tool calling is what allows a voice agent to do more than simply converse; tool calls allow an agent to look up a customer's account, check inventory, book an appointment, or process a payment, all within the flow of a spoken conversation.

In many systems, each tool call requires the model to pause generation, execute the external function, wait for a result, and then resume, which introduces added latency. In latency-sensitive voice AI applications, this can disrupt conversational flow, particularly if the tool call involves a slow backend system. Techniques such as non-blocking tool calls, in which the agent may continue the conversation while awaiting a response, are designed to mitigate this effect.

Turn-taking

Turn-taking refers to the mechanism by which participants in a conversation alternate between speaking and listening. In human conversation, turn-taking is managed through a complex set of verbal and non-verbal cues — changes in pitch, trailing off, eye contact, pauses — that signal when a speaker is yielding the floor and when a listener is ready to take it.

In voice AI, replicating natural turn-taking is one of the core design challenges. The system must correctly identify when the user has finished speaking (see: endpointing, voice activity detection), respond within a timeframe that feels natural, and handle cases where the user speaks unexpectedly or interrupts the system's output (see: barge-in, interruption handling). Systems that manage turn-taking poorly — cutting users off, responding too slowly, or failing to handle interruptions — feel unnatural regardless of the quality of their underlying responses.

Voice activity detection (VAD)

Voice activity detection (VAD) is the process of determining whether audio input contains speech. In a voice AI system, VAD serves as the gatekeeper for the processing pipeline: it monitors the incoming audio stream and signals when a user has begun speaking and, crucially, when they have stopped — triggering the downstream processing that produces a response.

VAD tuning involves a fundamental trade-off. A VAD model configured to react quickly to silence will minimize the delay before processing begins, but risks cutting the user off mid-sentence when they pause for breath or emphasis. A more conservative configuration might wait for a longer period of silence before responding, reducing the risk of interrupting the user but adding a perceptible delay at the end of each turn.

There is also no single correct silence threshold to tune for: the duration of natural within-turn pauses varies across languages and cultures, meaning a VAD model calibrated on one language may misfire more frequently on others.

Read more: Introducing UltraVAD - the first context-aware, audio-native endpointing model

Voice AI / Voice agent

Voice AI refers broadly to artificial intelligence systems that can engage in spoken conversation with a human: listening to speech, reasoning about it, and responding in natural language audio.

A voice agent is a specific implementation of voice AI designed to accomplish tasks or handle interactions autonomously, typically in a defined domain such as customer service, healthcare, or logistics.

Modern voice AI systems are built on one of two architectural approaches: the component pipeline, which chains together ASR, LLM, and TTS models in sequence, or the speech-native approach, in which a model processes and generates audio directly. The choice between these approaches has significant implications for latency, conversational naturalness, and the optimization strategies available to the team building the system.

Voice AI is distinct from earlier generations of automated voice technology such as interactive voice response (IVR) systems and phone trees, which rely on fixed menus and scripted paths rather than open-ended natural language understanding.

Read more: What we need to make voice AI fully agentic

HELLO@ULTRAVOX.AI

HELLO@ULTRAVOX.AI