Server events
These are events emitted from the OpenAI Realtime WebSocket server to the client.
Returned when an error occurs, which could be a client problem or a server problem. Most errors are recoverable and the session will stay open, we recommend to implementors to monitor and log error messages by default.
Sent by the server when an Item is added to the default Conversation. This can happen in several cases:
- When the client sends a
conversation.item.createevent. - When the input audio buffer is committed. In this case the item will be a user message containing the audio from the buffer.
- When the model is generating a Response. In this case the
conversation.item.addedevent will be sent when the model starts generating a specific Item, and thus it will not yet have any content (andstatuswill bein_progress).
The event will include the full content of the Item (except when model is generating a Response) except for audio data, which can be retrieved separately with a conversation.item.retrieve event if necessary.
Returned when a conversation item is finalized.
The event will include the full content of the Item except for audio data, which can be retrieved separately with a conversation.item.retrieve event if needed.
This event is the output of audio transcription for user audio written to the user audio buffer. Transcription begins when the input audio buffer is committed by the client or server (when VAD is enabled). Transcription runs asynchronously with Response creation, so this event may come before or after the Response events.
Realtime API models accept audio natively, and thus input transcription is a separate process run on a separate ASR (Automatic Speech Recognition) model. The transcript may diverge somewhat from the model's interpretation, and should be treated as a rough guide.
Returned when an input audio transcription segment is identified for an item.
Returned when input audio transcription is configured, and a transcription
request for a user message failed. These events are separate from other
error events so that the client can identify the related Item.
Returned when an earlier assistant audio message item is truncated by the
client with a conversation.item.truncate event. This event is used to
synchronize the server's understanding of the audio with the client's playback.
This action will truncate the audio and remove the server-side text transcript to ensure there is no text in the context that hasn't been heard by the user.
Returned when an item in the conversation is deleted by the client with a
conversation.item.delete event. This event is used to synchronize the
server's understanding of the conversation history with the client's view.
Returned when an input audio buffer is committed, either by the client or
automatically in server VAD mode. The item_id property is the ID of the user
message item that will be created, thus a conversation.item.created event
will also be sent to the client.
SIP Only: Returned when an DTMF event is received. A DTMF event is a message that
represents a telephone keypad press (0–9, *, #, A–D). The event property
is the keypad that the user press. The received_at is the UTC Unix Timestamp
that the server received the event.
Sent by the server when in server_vad mode to indicate that speech has been
detected in the audio buffer. This can happen any time audio is added to the
buffer (unless speech is already detected). The client may want to use this
event to interrupt audio playback or provide visual feedback to the user.
The client should expect to receive a input_audio_buffer.speech_stopped event
when speech stops. The item_id property is the ID of the user message item
that will be created when speech stops and will also be included in the
input_audio_buffer.speech_stopped event (unless the client manually commits
the audio buffer during VAD activation).
Returned in server_vad mode when the server detects the end of speech in
the audio buffer. The server will also send an conversation.item.created
event with the user message item that is created from the audio buffer.
Returned when the Server VAD timeout is triggered for the input audio buffer. This is configured
with idle_timeout_ms in the turn_detection settings of the session, and it indicates that
there hasn't been any speech detected for the configured duration.
The audio_start_ms and audio_end_ms fields indicate the segment of audio after the last
model response up to the triggering time, as an offset from the beginning of audio written
to the input audio buffer. This means it demarcates the segment of audio that was silent and
the difference between the start and end values will roughly match the configured timeout.
The empty audio will be committed to the conversation as an input_audio item (there will be a
input_audio_buffer.committed event) and a model response will be generated. There may be speech
that didn't trigger VAD but is still detected by the model, so the model may respond with
something relevant to the conversation or a prompt to continue speaking.
Returned when a Response is done streaming. Always emitted, no matter the
final state. The Response object included in the response.done event will
include all output Items in the Response but will omit the raw audio data.
Clients should check the status field of the Response to determine if it was successful
(completed) or if there was another outcome: cancelled, failed, or incomplete.
A response will contain all output items that were generated during the response, excluding any audio content.
Returned when a new Item is created during Response generation.
Returned when an Item is done streaming. Also emitted when a Response is interrupted, incomplete, or cancelled.
Returned when a new content part is added to an assistant message item during response generation.
Returned when a content part is done streaming in an assistant message item. Also emitted when a Response is interrupted, incomplete, or cancelled.
Returned when the text value of an "output_text" content part is updated.
Returned when the text value of an "output_text" content part is done streaming. Also emitted when a Response is interrupted, incomplete, or cancelled.
Returned when the model-generated transcription of audio output is updated.
Returned when the model-generated transcription of audio output is done streaming. Also emitted when a Response is interrupted, incomplete, or cancelled.
Returned when the model-generated audio is updated.
Returned when the model-generated audio is done. Also emitted when a Response is interrupted, incomplete, or cancelled.
Returned when the model-generated function call arguments are updated.
Returned when the model-generated function call arguments are done streaming. Also emitted when a Response is interrupted, incomplete, or cancelled.
Returned when MCP tool call arguments are updated during response generation.
Returned when MCP tool call arguments are finalized during response generation.
Returned when an MCP tool call has started and is in progress.
Returned when an MCP tool call has completed successfully.
Emitted at the beginning of a Response to indicate the updated rate limits. When a Response is created some tokens will be "reserved" for the output tokens, the rate limits shown here reflect that reservation, which is then adjusted accordingly once the Response is completed.
Returned when a conversation item is created. There are several scenarios that produce this event:
- The server is generating a Response, which if successful will produce
either one or two Items, which will be of type
message(roleassistant) or typefunction_call. - The input audio buffer has been committed, either by the client or the
server (in
server_vadmode). The server will take the content of the input audio buffer and add it to a new user message Item. - The client has sent a
conversation.item.createevent to add a new Item to the Conversation.