Primary navigation

Legacy APIs

Using realtime models

Prompt, reason, and tune realtime voice models.

gpt-realtime-2 is our state-of-the-art reasoning voice model for low-latency speech-to-speech applications. It can think before it speaks, follow instructions more reliably, use a larger context window, and call tools with greater precision than earlier realtime models.

To take advantage of these gains, design prompts with more intent. Define the assistant’s responsibilities, decision points, tool-calling behavior, and guardrails clearly: what it should do, when it should do it, and what it should avoid.

Start simple. Do not over-prompt upfront. Begin with a minimal prompt, run evaluations, then add instructions only for behaviors that fail in testing.

Choose a model

ModelUse whenPrompting focus
gpt-realtime-2

You need the strongest realtime reasoning, tool use, and instruction following.

Tune reasoning effort, preambles, tool policies, exact entity capture, and long-session state.

gpt-realtime-1.5You need a fast, reliable non-reasoning speech-to-speech model.

Follow the core realtime prompt structure and test for latency-sensitive behavior.

Realtime 2.0 Prompting Guide

Use gpt-realtime-2 when the voice agent needs stronger reasoning, tool selection, exact entity handling, or long-session state. Start with reasoning.effort: “low”, test default preamble behavior, and define clear confirmation boundaries before write actions.

What changed in Realtime 2

Prompt Realtime 2 as a reasoning voice agent, not as a basic voice bot.

ChangeWhat it means for prompts
ReasoningAllow the model to reason internally for complex tasks before speaking or calling tools. Use preambles to avoid awkward silence or unnecessary filler.
Prompt precision matters moreReplace broad guidance like “be helpful” with clear trigger, action, and exception rules: when to act, what to do, and when not to do it.
Instruction conflicts are more costlyRemove overlapping always, never, only, and must rules unless they are truly required. Define priority when rules compete.
Tool behavior is more steerableSpecify when the assistant should act immediately, ask for missing information, confirm high-precision details, retry after failure, or escalate.
Preambles are first-class behaviorThe model may speak brief updates before longer reasoning or tool-use flows. Steer when preambles should appear, how short they should be, and when to skip them.
Expanded context windowgpt-realtime-2 expands the realtime context window from 32k to 128k tokens, making it better suited for long sessions and larger system prompts.

Preambles aren’t hidden chain-of-thought. They’re short spoken updates such as “I’ll check that order now.” Don’t ask the model to reveal private reasoning.

Use short, labeled sections. The model should be able to find the relevant instructions quickly.

# Role and Objective

# Personality and Tone

# Language

# Reasoning

# Message Channels

# Preambles

# Verbosity

# Tools

# Unclear Audio

# Entity Capture

# Long Context Behavior

# Escalation

Not every use case needs every section. Add the sections that are relevant for your product.

Set reasoning effort

gpt-realtime-2 can trade latency for deeper reasoning. Use the lowest reasoning level that still gives the assistant enough intelligence for the workflow.

Start with low for most production voice agents. Tune up or down based on task complexity, latency tolerance, and failure cost.

EffortUse whenExample
minimalLowest latency matters most and the task is simple.Smart-home commands, timers, simple calendar checks.
lowYou need responsiveness plus basic reasoning.Customer support, order lookup, simple policy questions.
mediumThe assistant must reason through multi-step tasks.Technical support, diagnostics, complex routing.
highDeeper reasoning materially improves success.High-precision workflows, escalation decisions, tasks with constraints.

Beyond the API setting, steer the model on when and how much to reason.

## Reasoning

- For direct answers, simple lookups, and short confirmations, respond quickly and do not reason.
- For multi-step tasks, tool decisions, troubleshooting, or escalation, reason before acting.
- Do not perform extended reasoning when the user's audio is unclear; ask for clarification instead.

Use preambles intentionally

Preambles are short spoken updates that keep a voice agent feeling responsive while it reasons, looks something up, or calls a tool. Used well, they reassure the user that the assistant is working. Used poorly, they become filler and increase perceived latency.

gpt-realtime-2 generates preambles by default. Start by testing the default behavior. If it does not match your product experience, tune it explicitly.

Preamble generation and playback timeline

## Preambles

Use short preambles only when they help the user understand that work is happening.

### When to use a preamble

Use a preamble when:

- you are about to call a tool that may take noticeable time;
- you need to reason through a multi-step request;
- you are checking records, availability, account state, or policy details;
- you are preparing an escalation or handoff;
- silence would make the assistant feel unresponsive.

When a preamble is needed, output it immediately before substantive reasoning or tool use.

### When to not use a preamble

Do not use a preamble when:

- the answer is direct and can be given immediately;
- the user is only confirming, correcting, or declining something;
- the audio is unclear and you need clarification;
- the latest audio is silence, background noise, hold music, TV audio, or side conversation;
- the tool call is lightweight and the user would not benefit from an update.

### Preamble style

When using a preamble:

- keep it natural, calm, and concise;
- vary the wording across turns;
- describe the action, not the internal reasoning;
- avoid filler.

Avoid phrases like:

- "Let me think..."
- "Hmm..."
- "One moment while I process that..."
- "I am now going to access the tool..."

### Preamble length

Use one short sentence.

Do not exceed two short sentences unless the user needs an explanation before a high-impact action.

### Prefer

- "I'll check that order now."
- "I'll look up your appointment details."
- "I'll verify that before we make any changes."
- "I'll check the policy and then give you the next step."
- "I'll pull that up so we can make sure it's the right account."

### Avoid

- "Let me think about that for a second."
- "Please wait while I process your request."
- "I'm going to use my tools now."
- "Interesting question. I will reason through this carefully."

Control response length

gpt-realtime-2 follows length guidance best when the prompt specifies how much detail to give for each task type. Instead of telling the model to “be concise,” define what concise means in context: direct answers, tool results, troubleshooting, comparisons, and escalations may each need different response lengths.

## Verbosity

- Direct answers: Use 1-2 short sentences.
- Clarifying questions: Ask one question at a time.
- Tool results: Summarize the result first, then give only the next useful action.
- Product or option comparisons: Include key differences, tradeoffs, and who each option fits.
- Troubleshooting: Give one step at a time unless the user asks for the full procedure.
- Escalations: Briefly explain why escalation is needed and what will happen next.

Example:

User: Which plan should I choose?

Assistant: If you want the lowest cost, choose Basic. If you need team permissions and shared billing, choose Pro. If compliance review or admin controls matter, choose Enterprise.

Design tool behavior

gpt-realtime-2 is stronger at tool calling, but tool behavior still depends on prompt and tool-spec design. If the prompt does not define when to act, ask, confirm, or recover, the assistant may call tools too early, ask unnecessary questions, or repeat failed calls.

Set tool-call eagerness

High eagerness works well for read-only, low-risk actions. Low eagerness is better when tools modify data, trigger external effects, or depend on exact identifiers.

Tool typeDefault behavior
Read-only, low-risk lookupCall when intent and required fields are clear.
Read-only with exact identifierConfirm the identifier before lookup.
User-visible communicationDraft or summarize before sending.
Account changesConfirm before calling.
Purchases, cancellations, paymentsConfirm amount, target, and consequence before calling.
Irreversible or high-impact actionsConfirm explicitly and offer escalation when appropriate.

Use this balanced default when you have a mix of read and write actions. Tailor it based on your use case.

## Tools

Use only the tools explicitly provided in the current tool list. Do not invent, assume, simulate, or rename tools.

For read-only tools:

- Call the tool when the user's intent is clear and all required fields are available.
- Do not ask for confirmation unless the lookup depends on a high-precision identifier or there is meaningful risk of using the wrong record.
- Ask a clarification question only if a required field is missing, ambiguous, or conflicting.

For write tools or external actions:

- Summarize the intended action before calling the tool.
- Include the key consequence, such as what will be changed, sent, canceled, ordered, or charged.
- Ask for confirmation.
- Do not call the tool until the user clearly confirms.

For exact identifiers:

- Treat order IDs, tracking numbers, account numbers, confirmation codes, phone numbers, and email addresses as high precision.
- Normalize only when the field type is clear.
- Confirm the final value before account-specific lookups, validation, or write actions.

After tool calls:

- Only say an action was completed after the tool call succeeds.
- If the tool fails, explain the failure briefly, avoid raw errors, and give the user a clear next step.

High-risk example:

User: Charge my card for the remaining balance.

Bad:

Assistant: I’ve charged your card.

Good:

Assistant: To confirm, you want me to charge the card on file $248.16 for the remaining balance. Should I proceed?

Recover from tool failures

Tool failures are part of the conversation. A good recovery should explain what happened and give the user a clear next step.

Do not treat every failure the same. Recovery behavior should depend on the tool type, failure mode, and user impact. Some failures should be handled silently with a retry. Others require asking the user to clarify, correct an identifier, confirm a new action, or choose an alternate path.

## Tool Failures

If a tool call fails:

1. Briefly explain what failed in user-friendly language.
2. Do not blame the user or expose raw tool errors.
3. If the failure may be due to an exact identifier, read back the value used and ask the user to correct it.
4. If the failure may be temporary, offer to retry once.
5. If the same failure happens repeatedly, offer an alternate path or escalation.

Do not repeatedly call the same tool with the same arguments after failure.

Do not ask for a different identifier until you have first checked whether the captured value was correct.

Bad:

Assistant: Something went wrong.

Good:

Assistant: I couldn’t find a match for O R D dash 3 1 2 5 B 2 3. Did I get any part of that wrong?

Keep tool availability synchronized

Realtime models are eager to help. If the prompt mentions a tool that is not actually available, or if the tool list does not match the prompt, the model may invent a tool name or pretend it completed the action.

For example, if the prompt references lookup_order, but the provided tool is named search_orders, the model may call the wrong name or simulate the action.

## Tool Availability

Use only the tools that are explicitly provided in the current tool list.

Do not invent, assume, or simulate tools. If a tool is mentioned in the instructions but is not present in the tool list, treat it as unavailable.

If the user requests an action that requires an unavailable tool:

1. Do not pretend to complete the action.
2. Briefly explain that the tool is not available.
3. Offer the closest supported next step.

Only say an action was completed after the relevant tool call succeeds.

Use the prompt audit meta prompt in the appendix to review production prompts for contradictions, missing tools, and brittle instructions.

Handle silence and background audio

Voice agents tend to respond by default. In production, they often hear audio that should not receive a spoken response, such as silence, background noise, hold music, TV audio, or side conversations.

Use a no-op wait tool when the assistant should stay quiet and keep listening. The tool gives the model a valid non-speaking action instead of making it say things like “I’m here” or “I didn’t catch that.”

Tool design:

1
2
3
4
5
6
7
8
9
{
  "name": "wait_for_user",
  "description": "Call this when the latest audio does not need a spoken response, such as silence, background noise, hold music, TV audio, side conversation, or speech not addressed to the assistant. This tool helps end the turn without a spoken reply.",
  "parameters": {
    "type": "object",
    "properties": {},
    "required": []
  }
}

Pair it with prompt instructions:

## Handling Silence and Background Noise

If the latest audio is silence, background noise, hold music, TV audio, side conversation, or speech not addressed to you, call `wait_for_user`.

Do not respond conversationally after calling this tool.

Do not say "I'm here," "I didn't catch that," "Take your time," or "Let me know when you're ready."

Resume normal responses only when the user clearly addresses you or asks for help.

Use this for non-addressed audio, not for unclear user requests. If the user is clearly speaking to the assistant but the content is unintelligible, ask for clarification instead.

Use message channels deliberately

gpt-realtime-2 can produce user-visible intermediate messages in the commentary channel and final user-facing responses in the final channel. Use channel-specific instructions when the behavior depends on where it appears.

ChannelUser-visible?Used for
commentaryYesPreambles and tool calls.
finalYesFinal user-facing message.

For example, tool calls happen in the commentary channel. If you want the assistant to say something before, during, or after tool use, specify that behavior in relation to the commentary channel.

Before calling tools in the commentary channel, briefly tell the user what you are doing.

gpt-realtime-2 can emit multiple response phases in a single turn. In API output, this distinction is represented by the response.done event, which includes a phase value that indicates whether the content is commentary or the final answer.

You can use this field to handle each phase differently in your application. For example, commentary can be played or displayed as a short intermediate update, while final_answer can be reserved for the assistant’s completed response.

response.output[0].phase: "commentary"
response.output[1].phase: "final_answer"

Handle unclear audio

The model should only act on audio it can understand with confidence. If the audio is unclear, the model should ask a brief clarification question instead of guessing.

Do not let the model infer missing words, call tools, capture entities, generate preambles, or spend hidden reasoning time trying to reconstruct what the user may have said.

## Unclear Audio

- Only respond to clear audio or text.
- If the user's audio is not clear, ask for clarification using a short English phrase such as "Sorry, could you repeat that clearly?"
- Don't repeat the same unclear-audio clarification twice.
- Treat audio as unclear if it is ambiguous, noisy, silent, unintelligible, partially cut off, or if you are unsure of the exact words the user said.
- Do not guess what the user meant from unclear audio.
- Do not reason when the audio is unclear.
- Do not provide a preamble or call tools in the commentary channel when the audio is unclear.

Example:

User audio: “Check order three one-” [cut off]

Bad:

Assistant: I’ll check order 31 now.

Good:

Assistant: I heard only part of the order number. Could you repeat it digit by digit?

Capture exact entities

Many realtime workflows depend on exact values: order IDs, tracking numbers, email addresses, confirmation codes, account numbers, claim numbers, ticket IDs, support references, and phone numbers.

Voice makes this hard. Users speak quickly, group numbers in different ways, spell partial values, use filler, correct themselves mid-turn, or pronounce characters that sound alike. One wrong digit can fail a lookup or retrieve the wrong account.

Capture entities conservatively. Collect one value at a time, normalize only what is clear, confirm high-precision values before tool calls, and make every correction recoverable.

Collect one entity at a time

When a workflow needs multiple values, collect them one at a time. This prevents fields from blending together, especially in voice conversations.

## Entity Collection Order

Collect required values one at a time.

- Ask for only the next missing value.
- Do not ask for multiple values in the same turn.
- Before asking, check whether the value was already provided earlier in the conversation or the session.
- If a possible value already exists, confirm it with the user before using it.

Example:

"I see tracking number ABC-54321 from earlier. Should I use that one, or do you have a different tracking number?"

Do not call tools until the current value has been collected, validated, and confirmed.

Handle spelled-out characters

Use this when users spell IDs, codes, names, or email addresses one character at a time. The spoken form is input, not the final value.

## Spelled-Out Characters

When a user dictates an ID, code, or email character by character, treat the spoken sequence as one compact value. Preserve explicitly spoken separators like dash, dot, underscore, slash, or plus; otherwise do not add spaces or separators.

Examples:

- "A B C one two three" -> "ABC123"
- "B C dash nine eight seven" -> "BC-987"
- "J O H N at example dot com" -> "john@example.com"

Do not insert spaces between spelled-out characters unless the user explicitly says the value contains spaces.

Normalize spoken numbers carefully

For numeric identifiers, users may say digits individually, group them, or use natural number phrases. If the field expects one continuous numeric value, convert clear numeric speech into digits.

## Spoken Number Handling

Convert spoken numbers into digits when collecting numeric identifiers.

Examples:

- "one two three four" -> "1234"
- "one twenty three" -> "123"
- "one nineteen" -> "119"
- "ninety nine eleven" -> "9911"
- "nine thousand nine hundred eleven" -> "9911"

If multiple interpretations are plausible, ask the user to clarify before using the value.

Example:

"I heard either 119 or 1-19. Could you repeat the number digit by digit?"

Confirm exact identifiers before tool calls

Order IDs, tracking numbers, account numbers, claim numbers, confirmation codes, and similar identifiers are high-precision fields. Confirm them before using them in a tool call.

For numeric identifiers, read the value back digit by digit. Reading the value as a full number can hide errors.

Example:

Assistant: Just to confirm, I heard 8… 3… 5… 2… 1. Is that right?

If the user corrects one character or digit, repeat the full corrected value before calling the tool.

Example:

Assistant: Got it. I have 8… 3… 5… 7… 1. Is that correct?

## Exact Identifier Confirmation

Before calling tools with high-precision identifiers:

- Confirm the final normalized value with the user.
- Read numeric identifiers back digit by digit.
- Do not use guessed, partial, or ambiguous values.
- If the user corrects the value, repeat the full corrected value before calling the tool.

Confirm emails character by character

Email addresses are important values. Dots, dashes, underscores, repeated letters, and similar-sounding names can cause account lookup failures or send messages to the wrong address.

Ask the user to spell the email address:

Assistant: Could you spell the email address character by character so I can make sure I have it exactly right?

When reading it back, confirm the exact final address:

Assistant: Just to confirm, that is c-h-e-n at example dot com, right?

## Email Confirmation

Email addresses must be captured exactly.

If the user says the email naturally without spelling it out, ask them to repeat it character by character.

Example:

"Could you spell the email address character by character so I can make sure I have it exactly right?"

When reading an email back, confirm the exact final email address.

Example:

"Just to confirm, that is c-h-e-n at example dot com, right?"

Entity collection workflow

Avoid literal instruction traps

gpt-realtime-2 follows instructions more literally than earlier realtime models. Prompts that worked well on older models may need tuning.

Use precise language. The model may prioritize the exact wording of an instruction over the broader behavior you intended. Broad or rigid rules can dominate the assistant’s behavior in surprising ways, especially when multiple rules overlap.

Be careful with constraint words such as must, only, never, and always. Use them when the behavior is truly required, not as general emphasis. Overusing hard constraints can make the assistant rigid, overly cautious, or unable to handle reasonable exceptions.

Prefer precise scope:

For write actions that modify user data, ask for confirmation before calling the tool.

Avoid broad scope:

Always ask for confirmation before doing anything.

The broad version may cause unnecessary confirmations before harmless read-only lookups, such as checking order status, retrieving availability, or reading account information.

Literal interpretation example

General prompting recommendations:

  • Prefer explicit instructions over implied intent.
  • Avoid unnecessary constraint words unless behavior truly must be rigid.
  • Minimize contradictory guidance.
  • Be cautious with layered or competing priority instructions.
  • Test prompts incrementally. Small wording changes can have large behavioral effects.
  • When migrating from earlier realtime models, expect some prompts to require restructuring for best results.

Control language and accent separately

Language and accent should be controlled separately.

A user’s accent is not the same as their intended language. A user may speak English with a Hindi, Spanish, French, or Mandarin accent and still expect English responses.

Avoid broad language instructions such as:

Mirror the user.
Respond naturally in the user's language.
Switch languages when appropriate.
Sound local.
Adapt to the user's accent.

These are too broad. The model may interpret accent, filler words, backchannels, or isolated foreign words as a reason to switch languages.

English language policy

## Language

English is the default response language.

- Do not infer language from accent alone.
- Ignore short filler sounds, backchannels, and isolated foreign words for language detection.
- Only switch languages if the user explicitly asks or provides a substantive utterance in another language.
- If language confidence is low, ask a short clarification instead of guessing.
- Keep preambles, spoken bridges, tool-related messages, and final answers in the same language.
- Accent adaptation must not change the response language.

Multilingual policy

## Language

Default to English unless the user clearly uses another language.

Switch languages only when:

- the user explicitly asks to use another language;
- the user provides a substantive utterance in another language. A substantive utterance means the user gives a complete request, question, or correction in another language, not just a greeting, name, address, filler word, or borrowed phrase.

Do not switch languages based on:

- accent;
- pronunciation;
- filler words;
- short backchannels;
- names;
- addresses;
- isolated foreign words.

If uncertain, ask:

"Would you like me to continue in English or [LANGUAGE]?"

Accent control

gpt-realtime-2 can follow accent instructions more strongly, but vague accent prompts can cause drift or unintended language switching.

Accent-control prompts work best when they specify:

  • the target accent;
  • which characteristics should remain stable;
  • the intended pacing, stress, and prosody;
  • whether accent adaptation should affect language choice.

Instead of:

Sound Australian.

Use:

## Accent

Speak English with a light Australian accent.

- Keep the accent stable from the first word to the last.
- Use natural Australian vowel shaping, but keep speech easy to understand.
- Do not exaggerate the accent.
- Do not change response language based on the user's accent.

Custom voices

Use Custom Voices when standard voices cannot reliably meet brand, accent, or character requirements.

Prompting can steer accent, pacing, and delivery, but it cannot fully replace voice design. For use cases that require consistent branded voice identity or accent fidelity, consider Custom Voices.

Custom Voices are available only to approved customers. Contact your account team for access.

Maintain state in long sessions

gpt-realtime-2 expands the realtime context window from 32k to 128k tokens, making it better suited for long sessions. For dense two-way conversations, 128k tokens is best thought of as roughly 1-2 hours of dense raw audio context. This will vary depending on tool use, internal reasoning, injected records, and other session details.

For long-context use cases, gpt-realtime-2 performs best when it can tell what information is current, what is background, and what should be ignored if sources conflict. Do not rely on the model to infer source priority from a raw transcript or large context dump. Use structure.

Use a structured pattern when starting a session with a large amount of context, such as retrieved records, prior conversation history, policies, summaries, account notes, or background documents.

Migrate from earlier realtime models

When migrating from earlier realtime models, treat the prompt as a behavior surface, not just text to port.

  1. Use Codex or a strong reasoning model to restructure the prompt around the latest Realtime prompting guidance. Include a link to this prompting guide to ground the migration in best practices.
  2. Set reasoning effort to low instead of the default. Increase only for workflows that require deeper planning.
  3. Audit tool names, parameters, enums, JSON schemas, and other settings to make sure they match the expected implementation.
  4. Remove stale examples. Add short examples for happy paths, ambiguity, interruptions, tool calls, and fallback behavior.
  5. Compare representative conversations before and after migration. Check for regressions against an existing eval and document intentional behavior changes.
  6. Run a final consistency pass. Confirm the prompt clearly separates hard requirements, defaults, tool rules, safety rules, and fallback behavior.
  7. Run evals, inspect representative failures, and iterate on the prompt until the target behaviors are reliable.

Next steps