The Responses API supports a WebSocket mode for long-running, tool-call-heavy workflows. In this mode, you keep a persistent connection to /v1/responses and continue each turn by sending only new input items plus previous_response_id.
WebSocket mode is compatible with both Zero Data Retention (ZDR) and store=false.
Why use WebSocket mode
WebSocket mode is most useful when a workflow involves many model-tool round trips (for example, agentic coding or orchestration loops with repeated tool calls).
Because the connection stays open and each turn sends only incremental input, WebSocket mode reduces per-turn continuation overhead and improves end-to-end latency across long chains. For rollouts with 20+ tool calls, we have seen up to roughly 40% faster end-to-end execution.
Connect and create responses
In WebSocket mode, start each turn by sending a response.create event from the client. The payload mirrors the normal Responses create body, except that transport-specific fields like stream and background are not used.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
from websocket import create_connection
import json
import os
ws = create_connection(
"wss://api.openai.com/v1/responses",
header=[
f"Authorization: Bearer {os.environ['OPENAI_API_KEY']}",
],
)
ws.send(
json.dumps(
{
"type": "response.create",
"model": "gpt-5.2",
"store": False,
"input": [
{
"type": "message",
"role": "user",
"content": [{"type": "input_text", "text": "Find fizz_buzz()"}],
}
],
"tools": [],
}
)
)Clients can optionally warm up request state by sending response.create with generate: false. This is useful when you already know the tools, instructions, and/or custom messages you plan to send with an upcoming turn. generate: false does not return a model output, but prepares request state so the next generated turn can start faster. The warmup request returns a response ID that you can chain from with previous_response_id, including on later turns in a response chain. The next section explains how to continue a session using previous_response_id and incremental inputs.
Continue with incremental inputs
To continue a run, send another response.create with:
previous_response_idset to the prior response ID.inputcontaining only new items (for example, tool outputs and the next user message).
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
ws.send(
json.dumps(
{
"type": "response.create",
"model": "gpt-5.2",
"store": False,
"previous_response_id": "resp_123",
"input": [
{
"type": "function_call_output",
"call_id": "call_123",
"output": "tool result",
},
{
"type": "message",
"role": "user",
"content": [{"type": "input_text", "text": "Now optimize it."}],
},
],
"tools": [],
}
)
)How continuation works
WebSocket mode uses the same previous_response_id chaining semantics as HTTP mode, but it adds a lower-latency continuation path on the active socket.
On an active WebSocket connection, the service keeps one previous-response state in a connection-local in-memory cache (the most recent response). Continuing from that most recent response is fast because the service can reuse connection-local state. Because the previous-response state is retained only in memory and is not written to disk, you can use WebSocket mode in a way that is compatible with store=false and Zero Data Retention (ZDR).
If a previous_response_id is not in the in-memory cache, behavior depends on whether you store responses:
- With
store=true, the service may hydrate older response IDs from persisted state when available. Continuation can still work, but it usually loses the in-memory latency benefit. - With
store=false(including ZDR), there is no persisted fallback. If the ID is uncached, the request returnsprevious_response_not_found.
If a turn fails (4xx or 5xx), the service evicts the referenced previous_response_id from the connection-local cache. This prevents reusing stale cached state for that failed continuation.
Compaction and creating new responses
If you are using compaction, there are two different continuation patterns:
Server-side compaction (context_management)
When you enable server-side compaction (context_management with compact_threshold), compaction happens during normal /responses generation. In WebSocket mode, you continue the same way you normally do: send the next response.create with the latest previous_response_id and only new input items.
Standalone /responses/compact
The standalone /responses/compact endpoint returns a new compacted input window, not a response ID. After compaction, create a new response on your WebSocket connection using the compacted window as input (plus the next user/tool items).
Start a new chain by omitting previous_response_id or setting it to null. Pass the compacted output as-is; do not prune the returned window.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# Compact your current window (HTTP call)
compacted = client.responses.compact(
model="gpt-5.2",
input=long_input_items_array,
)
# Start a new response on the WebSocket using the compacted window
ws.send(
json.dumps(
{
"type": "response.create",
"model": "gpt-5.2",
"store": False,
"input": [
*compacted.output,
{
"type": "message",
"role": "user",
"content": [{"type": "input_text", "text": "Continue from here."}],
},
],
"tools": [],
}
)
)Connection behavior and limits
- Server events and ordering match the existing Responses streaming event model.
- A single WebSocket connection can receive multiple
response.createmessages, but it runs them sequentially (one in-flight response at a time). - No multiplexing support today. Use multiple connections if you need parallel runs.
- Connection duration is limited to 60 minutes. Reconnect when the limit is reached.
Reconnect and recover
When a connection closes (or hits the 60-minute limit), open a new WebSocket connection and continue with one of these patterns:
- If your prior response is persisted (
store=true) and you have a valid response ID, continue withprevious_response_idand new input items. - If you cannot continue the chain (for example,
store=false/ZDR orprevious_response_not_found), start a new response by settingprevious_response_idtonull(or omitting it) and send the full input context for the next turn. - If you compacted context with
/responses/compact, use the returned compacted window as the baseinputfor that new response, then append the latest user/tool items.
Errors to handle
previous_response_not_found
1
2
3
4
5
6
7
8
9
{
"type": "error",
"status": 400,
"error": {
"code": "previous_response_not_found",
"message": "Previous response with id 'resp_abc' not found.",
"param": "previous_response_id"
}
}websocket_connection_limit_reached
1
2
3
4
5
6
7
8
9
{
"type": "error",
"error": {
"type": "invalid_request_error",
"code": "websocket_connection_limit_reached",
"message": "Responses websocket connection limit reached (60 minutes). Create a new websocket connection to continue."
},
"status": 400
}