Voice agents | OpenAI API

Voice agents turn the same agent concepts into spoken, low-latency interactions. The key design choice is deciding whether the model should work directly with live audio or whether your application should explicitly chain speech-to-text, text reasoning, and text-to-speech.

Choose the right architecture

Architecture	Best for	Why
Speech-to-speech with live audio sessions	Natural, low-latency conversations	The model handles live audio input and output directly
Chained voice pipeline	Predictable workflows or extending an existing text agent	Your app keeps explicit control over transcription, text reasoning, and speech output

Voice workflows are an SDK-first surface. If you’re migrating a related Agent Builder project, see Migrate from Agent Builder for the current transition path.

Recommended starting points

The examples below are intentionally different architectures, not matching language tabs. The TypeScript and Python libraries expose different voice helpers today:

In TypeScript, the fastest path to a browser-based voice assistant is a RealtimeAgent and RealtimeSession.
In Python, the simplest path to extending an existing text agent into voice is a chained VoicePipeline.

Build a speech-to-speech voice agent

Use the live audio API path when the interaction should feel conversational and immediate. This is the best starting point for voice agents that need barge-in, low first-audio latency, natural turn taking, and realtime tool use.

The usual browser flow is:

Your application server creates an ephemeral client secret for the live audio session.
Your frontend creates a RealtimeSession.
The session connects over WebRTC in the browser or WebSocket on the server.
The agent handles audio turns, tools, interruptions, and handoffs inside that session.

Start a realtime voice session

typescript

1
2
3
4
5
6
7
8
9
10
11
12
13
14
import { RealtimeAgent, RealtimeSession } from "@openai/agents/realtime";

const agent = new RealtimeAgent({
  name: "Assistant",
  instructions: "You are a helpful voice assistant.",
});

const session = new RealtimeSession(agent, {
  model: "gpt-realtime-2",
});

await session.connect({
  apiKey: "ek_...(ephemeral key from your server)",
});

From there, attach tools, handoffs, and guardrails to the RealtimeAgent the same way you would attach them to a text agent. Keep audio transport concerns in the session layer, and keep business logic in the agent definition.

Start with the transport docs when you need lower-level control:

Build a chained voice workflow

Use the chained path when you want stronger control over intermediate text, existing text-agent reuse, or a simpler extension path from a non-voice workflow. In that design, your application explicitly manages:

speech-to-text
the agent workflow itself
text-to-speech

This is often the better fit for support flows, approval-heavy flows, or cases where you want durable transcripts and deterministic logic between each stage.

Run a chained voice pipeline

python

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
import asyncio
import numpy as np

from agents import Agent, function_tool
from agents.voice import AudioInput, SingleAgentVoiceWorkflow, VoicePipeline


@function_tool
def get_weather(city: str) -> str:
    """Get the weather for a given city."""
    return f"The weather in {city} is sunny."


agent = Agent(
    name="Assistant",
    instructions="You are a helpful voice assistant.",
    model="gpt-5.5",
    tools=[get_weather],
)


async def main() -> None:
    pipeline = VoicePipeline(workflow=SingleAgentVoiceWorkflow(agent))
    audio_input = AudioInput(buffer=np.zeros(24000 * 3, dtype=np.int16))
    result = await pipeline.run(audio_input)
    async for event in result.stream():
        if event.type == "voice_stream_event_audio":
            print("Received audio bytes", len(event.data))


if __name__ == "__main__":
    asyncio.run(main())

Use this path when each stage needs to be visible or replaceable. For example, you might store the transcript, run policy checks before the text agent responds, call internal systems, then generate speech only after the workflow reaches an approved answer.

Voice agents still use the same core agent building blocks

The voice surface changes the transport and audio loop, but the core workflow decisions are the same:

Use Using tools when the voice agent needs external capabilities.
Use Running agents when spoken workflows need streaming, continuation, or durable state.
Use Orchestration and handoffs when spoken workflows branch across specialists.
Use Guardrails and human review when spoken workflows need safety checks or approvals.
Use Integrations and observability when you need MCP-backed capabilities or want to inspect how the voice workflow behaved.

The practical rule is: choose the audio architecture first, then design the rest of the agent workflow the same way you would for text.

Next steps

Realtime and audio overview

Choose the right realtime or audio guide for your use case.

Managing conversations

Work with the Realtime session lifecycle and event model.

WebRTC connection

Connect browser and mobile audio directly to a Realtime session.

Realtime prompting guide

Tune reasoning, preambles, tools, entity capture, and voice behavior.

Suggested

Get started

Core concepts

Agents SDK

Tools

Run and scale

Evaluation

Realtime and audio

Specialized models

Going live

Legacy APIs

Resources

Getting Started

Using Codex

Configuration

Administration

Automation

Learn

Releases

Core Concepts

Plan

Build

Deploy

Conversion apps

Guides

Resources

Get started

Guides

File Upload

API

Measurement

Advertiser API

API Reference

Recent

Topics

Topics

Contribute

Categories

Topics

Programs

Events

Spaces

Choose the right architecture

Recommended starting points

Build a speech-to-speech voice agent

Build a chained voice workflow

Voice agents still use the same core agent building blocks

Next steps