You can use the Realtime API for transcription-only use cases, either with input from a microphone or from a file. For example, you can use it to generate subtitles or transcripts in real-time. With the transcription-only mode, the model will not generate responses.
If you want the model to produce responses, you can use the Realtime API in speech-to-speech conversation mode.
Realtime transcription sessions
To use the Realtime API for transcription, you need to create a transcription session, connecting via WebSockets or WebRTC.
Unlike the regular Realtime API sessions for conversations, the transcription sessions typically don’t contain responses from the model.
The transcription session object uses the same base session shape, but it always has a type of "transcription":
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
{
"object": "realtime.session",
"type": "transcription",
"id": "session_abc123",
"audio": {
"input": {
"format": {
"type": "audio/pcm",
"rate": 24000
},
"noise_reduction": {
"type": "near_field"
},
"transcription": {
"model": "gpt-4o-transcribe",
"prompt": "",
"language": "en"
},
"turn_detection": {
"type": "server_vad",
"threshold": 0.5,
"prefix_padding_ms": 300,
"silence_duration_ms": 500
}
}
},
"include": ["item.input_audio_transcription.logprobs"]
}Session fields
type: Alwaystranscriptionfor realtime transcription sessions.audio.input.format: Input encoding for audio that you append to the buffer. Supported types are:audio/pcm(24 kHz mono PCM; only arateof24000is supported).audio/pcmu(G.711 μ-law).audio/pcma(G.711 A-law).
audio.input.noise_reduction: Optional noise reduction that runs before VAD and turn detection. Use{ "type": "near_field" },{ "type": "far_field" }, ornullto disable.audio.input.transcription: Optional asynchronous transcription of input audio. Supply:model: One ofwhisper-1,gpt-4o-transcribe-latest,gpt-4o-mini-transcribe, orgpt-4o-transcribe.language: ISO-639-1 code such asen.prompt: Prompt text or keyword list (model-dependent) that guides the transcription output.
audio.input.turn_detection: Optional automatic voice activity detection (VAD). Set tonullto manage turn boundaries manually. Forserver_vad, you can tunethreshold,prefix_padding_ms,silence_duration_ms,interrupt_response,create_response, andidle_timeout_ms. Forsemantic_vad, configureeagerness,interrupt_response, andcreate_response.include: Optional list of additional fields to stream back on events (for exampleitem.input_audio_transcription.logprobs).
You can find more information about the transcription session object in the API reference.
Handling transcriptions
When using the Realtime API for transcription, you can listen for the conversation.item.input_audio_transcription.delta and conversation.item.input_audio_transcription.completed events.
For whisper-1 the delta event will contain full turn transcript, same as completed event. For gpt-4o-transcribe and gpt-4o-mini-transcribe the delta event will contain incremental transcripts as they are streamed out from the model.
Here is an example transcription delta event:
1
2
3
4
5
6
7
{
"event_id": "event_2122",
"type": "conversation.item.input_audio_transcription.delta",
"item_id": "item_003",
"content_index": 0,
"delta": "Hello,"
}Here is an example transcription completion event:
1
2
3
4
5
6
7
{
"event_id": "event_2122",
"type": "conversation.item.input_audio_transcription.completed",
"item_id": "item_003",
"content_index": 0,
"transcript": "Hello, how are you?"
}Note that ordering between completion events from different speech turns is not guaranteed. You should use item_id to match these events to the input_audio_buffer.committed events and use input_audio_buffer.committed.previous_item_id to handle the ordering.
To send audio data to the transcription session, you can use the input_audio_buffer.append event.
You have 2 options:
- Use a streaming microphone input
- Stream data from a wav file
Voice activity detection
The Realtime API supports automatic voice activity detection (VAD). Enabled by default, VAD will control when the input audio buffer is committed, therefore when transcription begins.
Read more about configuring VAD in our Voice Activity Detection guide.
You can also disable VAD by setting the audio.input.turn_detection property to null, and control when to commit the input audio on your end.
Additional configurations
Noise reduction
Use the audio.input.noise_reduction property to configure how to handle noise reduction in the audio stream.
{ "type": "near_field" }: Use near-field noise reduction (default).{ "type": "far_field" }: Use far-field noise reduction.null: Disable noise reduction.
Using logprobs
You can use the include property to include logprobs in the transcription events, using item.input_audio_transcription.logprobs.
Those logprobs can be used to calculate the confidence score of the transcription.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
{
"type": "session.update",
"session": {
"audio": {
"input": {
"format": {
"type": "audio/pcm",
"rate": 24000
},
"transcription": {
"model": "gpt-4o-transcribe"
},
"turn_detection": {
"type": "server_vad",
"threshold": 0.5,
"prefix_padding_ms": 300,
"silence_duration_ms": 500
}
}
},
"include": ["item.input_audio_transcription.logprobs"]
}
}