Skip to content
Primary navigation

Translation client events

These are events that the OpenAI Realtime Translation WebSocket server will accept from the client.

session.update

Send this event to update the translation session configuration. Translation sessions support updates to audio.output.language, audio.input.transcription, and audio.input.noise_reduction.

Translation session fields to update. The session type and model are set at creation and cannot be changed with session.update.

audio: optional object { input, output }

Configuration for translation input and output audio.

input: optional object { noise_reduction, transcription }
noise_reduction: optional object { type }

Optional input noise reduction. Set to null to disable it.

Type of noise reduction. near_field is for close-talking microphones such as headphones, far_field is for far-field microphones such as laptop or conference room microphones.

One of the following:
"near_field"
"far_field"
transcription: optional object { model }

Optional source-language transcription. When configured, the server emits session.input_transcript.delta events. Translation itself still runs from the input audio stream.

model: string

The transcription model to use for source transcript deltas.

output: optional object { language }
language: optional string

Target language for translated output audio and transcript deltas.

type: "session.update"

The event type, must be session.update.

event_id: optional string

Optional client-generated ID used to identify this event.

maxLength512

session.input_audio_buffer.append

Send this event to append audio bytes to the translation session input audio buffer.

WebSocket translation sessions accept base64-encoded 24 kHz PCM16 mono little-endian raw audio bytes. Unsupported websocket audio formats return a validation error because lower-quality audio materially degrades translation quality.

Translation consumes 200 ms engine frames. For best realtime behavior, append audio in 200 ms chunks. If a chunk is shorter, the server buffers it until it has enough audio for one frame. If a chunk is longer, the server splits it into 200 ms frames and enqueues them back-to-back.

Keep appending silence while the session is active. If a client stops sending audio and later resumes, model time treats the resumed audio as contiguous with the previous audio rather than as a real-world pause.

audio: string

Base64-encoded 24 kHz PCM16 mono audio bytes.

type: "session.input_audio_buffer.append"

The event type, must be session.input_audio_buffer.append.

event_id: optional string

Optional client-generated ID used to identify this event.

maxLength512

session.close

Gracefully close the realtime translation session. The server flushes pending input audio and emits any remaining translated output before closing the session.

type: "session.close"

The event type, must be session.close.

event_id: optional string

Optional client-generated ID used to identify this event.

maxLength512