Primary navigation

Legacy APIs

API deployment checklist

Focused guide listing commonly underused, high-value design choices that can make a real difference in deployment quality, speed, cost, and reliability.

ContentsExpected impact
Use the Responses APIQuality, cost, latency, reliability
Set up reasoning.effortQuality, cost, latency
Set up text.verbosityQuality, cost, latency
Set up the assistant phase parameterQuality, cost
Use tool_searchCost, latency
Leverage built-in toolsQuality
Leverage compactionCost
Use prompt_cache_keyLatency, cost
Use reasoning.encrypted_contentQuality, latency
Use background=TrueResumability
Use WebSocket modeLatency

Use the Responses API

Always start with the Responses API. It is OpenAI’s flagship API and the best place to access the newest model behavior, built-in tools, stateful workflows, and agent features.

Set up reasoning.effort

Use reasoning.effort to decide how much thinking the model should do before it answers.

For gpt-5.5, the supported values are none, low, medium, high, and xhigh. The default is medium. Lower effort is faster and uses fewer reasoning tokens. Higher effort gives the model more time for planning, debugging, synthesis, and multi-step tradeoffs. The right value depends on the task, not just the model.

Use low when the job is mostly extraction, routing, classification, or a simple rewrite. Use medium or high when the model needs to diagnose a problem, compare options, write a plan, or reason through code. Reserve xhigh for cases where your evals show the extra latency is worth it.

Tune reasoning effort for the task
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
from openai import OpenAI

client = OpenAI()

prompt = """
Our CI job started failing after a dependency bump.

Error:
TypeError: Timeout.__init__() got an unexpected keyword argument 'connect'

Identify the likeliest root cause and the smallest safe fix.
"""

response = client.responses.create(
    model="gpt-5.5",
    reasoning={"effort": "high"},
    input=prompt,
)

print(response.output_text)

Set up text.verbosity

text.verbosity is the main lever for balancing brevity against completeness. Use lower verbosity when the product needs a quick, compact answer, and higher verbosity when the response needs richer explanation, clearer structure, or complete context. Lower verbosity means fewer output tokens, so the model generates less and returns output faster.

For coding, medium and high tend to produce longer, more organized output with clearer structure. low keeps the answer tighter and more minimal.

Set lower verbosity for compact output
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
from openai import OpenAI

client = OpenAI()

response = client.responses.create(
    model="gpt-5.5",
    text={"verbosity": "low"},
    input="""
    Summarize this incident for the next on-call engineer.
    - checkout latency spiked from 220 ms to 4.8 s
    - only us-east-1 was affected
    - rollback is complete
    - likely trigger: cache stampede after deploy
    """,
)

print(response.output_text)

Set up the assistant phase parameter

phase is a label on assistant messages in the conversation history. It indicates to the model whether a prior assistant message was an intermediate working commentary or the final answer. Use phase: "commentary" for progress updates, pre-tool-call notes, and other in-between messages. Use phase: "final_answer" for the completed response.

The assistant might say something like:

Assistant commentary message
1
2
3
4
5
{
  "role": "assistant",
  "phase": "commentary",
  "content": "I'm checking the logs and comparing them to the last successful deploy."
}

That is not the answer. It is a progress note. Later, the assistant might say:

Assistant final answer message
1
2
3
4
5
{
  "role": "assistant",
  "phase": "final_answer",
  "content": "The deploy failed because the migration referenced a column that does not exist in production."
}

This is useful in long-running or tool-heavy workflows where the assistant may produce visible progress updates before it finishes. When you send that history back to the model, preserve phase on assistant messages so the model can tell which messages are progress updates and which message is the final result.

Preserve and resend phase on assistant messages on follow-up requests for new models like gpt-5.3-codex and later. It helps address early stopping, ensuring the agent runs until it reaches the final answer.

Instead of loading the full tool catalog into every request, add {"type": "tool_search"} and mark expensive tool definitions with defer_loading: true. The model can then load the subset it needs at runtime. At request start, the model only sees the search tool name and description. If the model decides it needs a deferred tool, it runs tool search, and only then are the deferred tool definitions loaded into context. Only then will the model call them. This saves tokens and preserves cache performance.

There are two modes:

  • Hosted tool search is the simpler option. Use it when you already know which tools could be available for the request.
  • Client-executed tool search is for cases where your app has to decide what tools are available, like based on the user’s tenant, project, permissions, or internal registry.

Start with hosted tool search unless your app really needs to control discovery itself.

Group your tools by user intent. Use namespaces or MCP servers when you can. It is easier for the model to choose between a few clear groups than a long flat list of functions. We recommend keeping each namespace under about 10 functions for optimal token efficiency and model performance.

Keep namespace descriptions short and discriminative. Put the detailed instructions inside the deferred tool definitions. Avoid making one giant namespace for everything.

Use hosted tool search with deferred tools
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
from openai import OpenAI

client = OpenAI()

billing_lookup_invoice = {
    "type": "function",
    "name": "billing.lookup_invoice",
    "description": "Look up invoice state, taxes, credits, and payment attempts.",
    "parameters": {
        "type": "object",
        "properties": {
            "invoice_id": {"type": "string"},
        },
        "required": ["invoice_id"],
        "additionalProperties": False,
    },
    "strict": True,
    "defer_loading": True,
}

crm_get_account = {
    "type": "function",
    "name": "crm.get_account",
    "description": "Fetch account owner, plan, health, and payment history.",
    "parameters": {
        "type": "object",
        "properties": {
            "account_id": {"type": "string"},
        },
        "required": ["account_id"],
        "additionalProperties": False,
    },
    "strict": True,
    "defer_loading": True,
}

response = client.responses.create(
    model="gpt-5.5",
    input=(
        "Find the right billing tool and explain why invoice INV-1043 still "
        "shows overdue after a payment yesterday."
    ),
    tools=[
        {"type": "tool_search"},
        billing_lookup_invoice,
        crm_get_account,
    ],
)

print(response.output_text)

Leverage built-in tools

Built-in tools are the API’s native capabilities. Instead of building every tool yourself, you can give the model access to tools that already work inside the Responses API. The model can then decide when to use them.

OpenAI keeps adding more native tools, so start with built-in tools when they fit your workflow. Build custom tools when native options do not cover the task. Current built-in tools and related tool options include:

  • Web search: Search the web for up-to-date information
  • File search: Search uploaded files or vector stores
  • Code interpreter: Run Python for analysis, math, charts, and file processing
  • Shell: Run shell commands in a hosted container or your own runtime
  • Computer use: Operate a UI through screenshots, clicks, typing, and scrolling
  • Image generation: Generate or edit images
  • MCP/connectors: Connect the model to external services and tools
  • Skills: Attach reusable instruction bundles and workflow files
  • Apply patch: Make structured code edits

There is also a model-quality reason to prefer them. Built-in tools are in-distribution for our post-training, meaning that the models are trained and evaluated around these tool shapes, behaviors, and outputs. With built-in tools, OpenAI models support better tool selection, cleaner execution, and fewer failures than with new tools.

Leverage compaction

Compaction is a context engineering tool: it decides what information the model carries forward across many turns. In long-running agents, the problem is not just, “Will I hit the context limit?” It is that old messages, tool logs, retries, and stale details crowd out the state the model needs.

Compaction gives you a controlled way to reduce context size while preserving state needed for subsequent turns. After a meaningful milestone, like finishing a debugging phase or narrowing a root cause, you can compact the prior window and continue from the compacted output. This keeps the model sharp because the next turn is built around the important state, not every intermediate reasoning, failed command, and obsolete branch of reasoning.

There are two ways to leverage compaction:

  • Let the server handle it: if you use previous_response_id, turn on context_management with a compact_threshold. The server will automatically compact the conversation when it gets too large. You keep sending only the newest user message.
  • Do it yourself: if you manage the full input array yourself, call client.responses.compact(). It gives back a smaller context window. Use that returned output directly in the next responses.create() call.

Do not edit the compacted output. It is not a human summary, but the machine state that helps the model continue. Pass it forward as-is, then add the next user message.

Continue from compacted response state
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
from openai import OpenAI

client = OpenAI()

# Full window collected from a long debugging session:
# user messages, assistant outputs, tool calls, and tool outputs.
long_window = session_items

compacted = client.responses.compact(
    model="gpt-5.5",
    input=long_window,
)

next_response = client.responses.create(
    model="gpt-5.5",
    store=False,
    input=[
        *compacted.output,  # Use compact output as-is.
        {
            "type": "message",
            "role": "user",
            "content": (
                "We found the bad cache invalidation path. Write the fix plan "
                "and the verification checklist."
            ),
        },
    ],
)

print(next_response.output_text)

Use prompt_cache_key

Prompt caching automatically reduces latency and cost when requests reuse the same long prefix. For high-volume workflows, set prompt_cache_key consistently for requests that share the same stable prefix.

The cache key is combined with the prompt prefix hash, so it helps route similar requests to the same cache without changing the model input. Keep the key stable for genuinely shared prefixes, and choose a granularity that avoids sending too much traffic to one prefix-key pair. If one prefix and prompt_cache_key combination exceeds about 15 requests per minute, requests may overflow to additional machines and reduce cache effectiveness.

Route related requests to the same prompt cache
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
from openai import OpenAI

client = OpenAI()

instructions = """
You are the support agent for Acme.
Follow the Acme support policy and escalation rubric.
Use the same tone, safety rules, and tool plan for each ticket.
"""

response = client.responses.create(
    model="gpt-5.5",
    prompt_cache_key="tenant-acme-support-agent",
    instructions=instructions,
    input="Summarize the current escalation for the on-call lead.",
)

print(response.output_text)

Use reasoning.encrypted_content

Always round-trip reasoning items. This helps the model by allowing it to work from its prior reasoning. If your Zero Data Retention (ZDR) requirements do not allow storing response data, this is where reasoning.encrypted_content is important. reasoning.encrypted_content gives you a stateless handoff.

Add reasoning.encrypted_content to include, and reasoning items in the response output will include encrypted reasoning content that can be passed back into the next request. Your app does not need to understand that value. It just keeps the reasoning item exactly as returned and sends it back during the next turn, so the model can use it to continue the workflow.

Pass encrypted reasoning between stateless turns
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
from openai import OpenAI

client = OpenAI()

first = client.responses.create(
    model="gpt-5.5",
    store=False,
    reasoning={"effort": "medium"},
    include=["reasoning.encrypted_content"],
    input="Investigate why invoice INV-1043 has mismatched tax totals.",
)

second = client.responses.create(
    model="gpt-5.5",
    store=False,
    reasoning={"effort": "medium"},
    include=["reasoning.encrypted_content"],
    input=[
        *first.output,
        {
            "role": "user",
            "content": "Now write the customer-facing explanation in plain English.",
        },
    ],
)

print(second.output_text)

Use background=True

Use background=True for requests that may take a long time. Instead of keeping the client connection open, the API starts a job and returns an ID. Your app can poll that job until it finishes, fails, or is canceled. Use it for large analyses, long tool runs, or work that needs status and retry behavior.

background=True requires store=True.

Run and poll a background response
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
from openai import OpenAI
import time

client = OpenAI()

job = client.responses.create(
    model="gpt-5.5",
    background=True,
    store=True,
    input="Analyze this large log bundle and cluster the primary failure modes.",
    tools=[
        {
            "type": "code_interpreter",
            "container": {
                "type": "auto",
                "file_ids": [log_bundle_file_id],
            },
        }
    ],
)

while job.status in {"queued", "in_progress"}:
    time.sleep(2)
    job = client.responses.retrieve(job.id)

print(job.output_text)

You can combine it with stream=True for progress events, but the first event may take longer than a normal request.

From the UI perspective, background mode indicates, “This is running; here is the status; the result will appear here when it’s ready.”

Note: background=True is not compatible with Zero Data Retention.

Use WebSocket mode

WebSocket mode is built for long-running, tool-call-heavy workflows where you keep a persistent connection open and continue by sending only new input items plus previous_response_id. For rollouts with 20 or more tool calls, this approach is roughly 40% faster end-to-end.

How this works: The first message will look like a normal Responses request: model, instructions, tools, and user input. The server streams events back. If the model asks for a tool, your app runs the tool. Then, instead of sending a new HTTP request, you send another response.create event on the same socket with the prior previous_response_id and the new item. That is where the latency win comes from. In plain HTTP, every follow-up is a fresh request. In WebSocket mode, the connection stays open and the most recent response state stays warm in memory on that connection. When the next turn continues from that response, the backend has to do less setup work.

If your workflow is one request, one answer, then keep HTTP. If your workflow behaves like a long-running agent, try WebSocket mode.

A single WebSocket connection handles one in-flight response at a time, so parallel work needs multiple connections. Connections currently top out at 60 minutes. Continuation uses the same previous_response_id semantics as HTTP mode, with a connection-local cache for the most recent response.

Note: WebSocket mode works with ZDR because your data is not stored to disk, only stored in memory.

The default Python sample uses websocket-client (pip install websocket-client). The JavaScript sample uses ws (npm install ws).

Start a Responses API WebSocket session
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
from openai import OpenAI
from websocket import create_connection
import json

client = OpenAI()

ws = create_connection(
    "wss://api.openai.com/v1/responses",
    header=[f"Authorization: Bearer {client.api_key}"],
)

# Same request body you would send to client.responses.create(...).
ws.send(
    json.dumps(
        {
            "type": "response.create",
            "model": "gpt-5.5",
            "store": False,
            "input": [
                {
                    "type": "message",
                    "role": "user",
                    "content": [
                        {
                            "type": "input_text",
                            "text": (
                                "Find the flaky test in this run, call the tools "
                                "you need, and keep going until you can explain "
                                "the root cause."
                            ),
                        }
                    ],
                }
            ],
            "tools": [test_log_tool, code_search_tool],
        }
    )
)

first_event = json.loads(ws.recv())
print(first_event["type"])

Final takeaway

Responses API is the foundation for building smarter, more capable OpenAI applications. The real advantage is that it lets developers move from one-off prompts to durable, tool-using, context-aware workflows that can adapt to the complexity of the task. Follow this guide to see higher performance in real deployments.