Primary navigation

Legacy APIs

Audio and speech

Understand audio modalities, streaming, latency, and speech concepts.

Audio models can understand spoken input, generate spoken output, or do both in the same interaction. This guide explains the vocabulary used across OpenAI’s audio docs. When you’re ready to choose an implementation path, start with the Realtime and audio overview.

Audio modalities

An audio application combines one or more of these modalities:

ModalityMeaningCommon use cases
Audio inputThe model receives sound from a user or app.Voice agents, transcription, translation.
Audio outputThe model or API returns spoken audio.Voice agents, text to speech, spoken responses.
Text transcriptSpeech becomes text.Captions, call analysis, search, records.
Text promptText controls what the model says or does.Speech generation, scripted voice flows, prompts.

Common speech tasks

Speech to text converts speech into text. Use it for captions, notes, transcripts, analytics, search, and accessibility. Transcription can be request-based for files or streaming for live audio.

Text to speech converts text into spoken audio. Use it for narration, assistants, accessibility, and generated voice responses. Speech generation can stream audio back as the model produces it.

Speech to speech lets a model listen, reason, and speak in one low-latency session. Use it for conversational voice agents when the assistant needs to respond, call tools, or maintain session state.

Speech translation listens to speech in one language and returns translated speech or transcript output in another language. Use a dedicated realtime translation session when translation should begin continuously as audio arrives.

Streaming and latency

Streaming means the client and service exchange partial input or output while the interaction is still active. Streaming is useful when users expect immediate feedback, such as live captions, calls, voice agents, and translation.

Lower latency requires a realtime connection, more careful audio handling, and a session model that can emit partial events. Request-based APIs are simpler for file uploads and non-interactive work, but they don’t support the same live interaction patterns.

Request-based APIs and realtime sessions

OpenAI supports two broad audio architectures:

ArchitectureUse whenExamples
Request-based audio APIsYou have a file, a text input, or a bounded request.Speech to text, text to speech.
Realtime sessionsAudio is live and the app needs low-latency events.Voice agents, translation, transcription.
Multimodal chat completionsYou are extending an existing chat flow with audio.Audio input or output.

For build-path guidance, see the Realtime and audio overview.

Add audio to your existing application

Models such as gpt-realtime and gpt-audio are natively multimodal, meaning they can understand and generate audio and text as input and output.

For live browser speech-to-speech interactions, start with a realtime session in the JavaScript SDK:

Start a realtime voice session
1
2
3
4
5
6
7
8
9
10
11
12
13
14
import { RealtimeAgent, RealtimeSession } from "@openai/agents/realtime";

const agent = new RealtimeAgent({
  name: "Assistant",
  instructions: "You are a helpful voice assistant.",
});

const session = new RealtimeSession(agent, {
  model: "gpt-realtime-2",
});

await session.connect({
  apiKey: "ek_...(ephemeral key from your server)",
});

This example uses JavaScript because browser voice agents connect with WebRTC from the client. For Python voice workflows, use the Voice agents guide, which covers chained voice pipelines.

If you already have a text-based LLM application with the Chat Completions endpoint, you may want to add audio capabilities. For example, if your chat application supports text input, you can add audio input and output: include audio in the modalities array and use an audio model, like gpt-audio.

The Responses API docs currently describe text and image inputs with text outputs. For this audio-chat pattern, use Chat Completions with an audio-capable model.

Create a human-like audio response to a prompt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
import { writeFileSync } from "node:fs";
import OpenAI from "openai";

const openai = new OpenAI();

// Generate an audio response to the given prompt
const response = await openai.chat.completions.create({
  model: "gpt-audio",
  modalities: ["text", "audio"],
  audio: { voice: "alloy", format: "wav" },
  messages: [
    {
      role: "user",
      content: "Is a golden retriever a good family dog?"
    }
  ],
  store: true,
});

// Inspect returned data
console.log(response.choices[0]);

// Write audio data to a file
writeFileSync(
  "dog.wav",
  Buffer.from(response.choices[0].message.audio.data, 'base64'),
  { encoding: "utf-8" }
);