Transcription Sessions
ModelsExpand Collapse
TranscriptionSessionCreateResponse object { client_secret, input_audio_format, input_audio_transcription, 2 more } A new Realtime transcription session configuration.
When a session is created on the server via REST API, the session object
also contains an ephemeral key. Default TTL for keys is 10 minutes. This
property is not present when a session is updated via the WebSocket API.
A new Realtime transcription session configuration.
When a session is created on the server via REST API, the session object also contains an ephemeral key. Default TTL for keys is 10 minutes. This property is not present when a session is updated via the WebSocket API.
client_secret: object { expires_at, value } Ephemeral key returned by the API. Only present when the session is
created on the server via REST API.
Ephemeral key returned by the API. Only present when the session is created on the server via REST API.
The format of input audio. Options are pcm16, g711_ulaw, or g711_alaw.
input_audio_transcription: optional object { language, model, prompt } Configuration of the transcription model.
Configuration of the transcription model.
model: optional string or "whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 3 moreThe model used for transcription. Current options are whisper-1, gpt-4o-mini-transcribe, gpt-4o-mini-transcribe-2025-12-15, gpt-4o-transcribe, gpt-4o-transcribe-diarize, and gpt-realtime-whisper.
The model used for transcription. Current options are whisper-1, gpt-4o-mini-transcribe, gpt-4o-mini-transcribe-2025-12-15, gpt-4o-transcribe, gpt-4o-transcribe-diarize, and gpt-realtime-whisper.
modalities: optional array of "text" or "audio"The set of modalities the model can respond with. To disable audio,
set this to [“text”].
The set of modalities the model can respond with. To disable audio, set this to [“text”].
turn_detection: optional object { prefix_padding_ms, silence_duration_ms, threshold, type } Configuration for turn detection. Can be set to null to turn off. Server
VAD means that the model will detect the start and end of speech based on
audio volume and respond at the end of user speech.
Configuration for turn detection. Can be set to null to turn off. Server
VAD means that the model will detect the start and end of speech based on
audio volume and respond at the end of user speech.
Amount of audio to include before the VAD detected speech (in milliseconds). Defaults to 300ms.
Duration of silence to detect speech stop (in milliseconds). Defaults to 500ms. With shorter values the model will respond more quickly, but may jump in on short pauses from the user.