Skip to content
Primary navigation

Evals

Manage and run evals in the OpenAI platform.

resource openai_eval

required Expand Collapse
data_source_config: Attributes

The configuration for the data source used for the evaluation runs. Dictates the schema of the data used in the evaluation.

item_schema?: Map[JSON]

The json schema for each row in the data source.

type?: String

The type of data source. Always custom.

include_sample_schema?: Bool

Whether the eval should expect you to populate the sample namespace (ie, by generating responses off of your data source)

metadata?: Map[JSON]

Metadata filters for the logs data source.

testing_criteria: List[Attributes]

A list of graders for all eval runs in this group. Graders can reference variables in the data source using double curly braces notation, like {{item.variable_name}}. To reference the model’s output, use the sample namespace (ie, {{sample.output_text}}).

input?: List[Attributes]

A list of chat messages forming the prompt or context. May include variable references to the item namespace, ie {{item.name}}.

content: String

The content of the message.

role: String

The role of the message (e.g. “system”, “assistant”, “user”).

type?: String

The type of the message input. Always message.

labels?: List[String]

The labels to classify to each item in the evaluation.

model?: String

The model to use for the evaluation. Must support structured outputs.

name: String

The name of the grader.

passing_labels?: List[String]

The labels that indicate a passing result. Must be a subset of labels.

type?: String

The object type, which is always label_model.

operation?: String

The string check operation to perform. One of eq, ne, like, or ilike.

reference?: String

The reference text. This may include template strings.

evaluation_metric?: String

The evaluation metric to use. One of cosine, fuzzy_match, bleu, gleu, meteor, rouge_1, rouge_2, rouge_3, rouge_4, rouge_5, or rouge_l.

pass_threshold?: Float64

The threshold for the score.

source?: String

The source code of the python script.

image_tag?: String

The image tag to use for the python script.

range?: List[Float64]

The range of the score. Defaults to [0, 1].

sampling_params?: Attributes

The sampling parameters for the model.

max_completions_tokens?: Int64

The maximum number of tokens the grader model may generate in its response.

reasoning_effort?: String

Constrains effort on reasoning for reasoning models. Currently supported values are none, minimal, low, medium, high, and xhigh. Reducing reasoning effort can result in faster responses and fewer tokens used on reasoning in a response.

  • gpt-5.1 defaults to none, which does not perform reasoning. The supported reasoning values for gpt-5.1 are none, low, medium, and high. Tool calls are supported for all reasoning values in gpt-5.1.
  • All models before gpt-5.1 default to medium reasoning effort, and do not support none.
  • The gpt-5-pro model defaults to (and only supports) high reasoning effort.
  • xhigh is supported for all models after gpt-5.1-codex-max.
seed?: Int64

A seed value to initialize the randomness, during sampling.

temperature?: Float64

A higher temperature increases randomness in the outputs.

top_p?: Float64

An alternative to temperature for nucleus sampling; 1.0 includes all tokens.

optional Expand Collapse
name?: String

The name of the evaluation.

metadata?: Map[String]

Set of 16 key-value pairs that can be attached to an object. This can be useful for storing additional information about the object in a structured format, and querying for objects via API or the dashboard.

Keys are strings with a maximum length of 64 characters. Values are strings with a maximum length of 512 characters.

computed Expand Collapse
id: String

Unique identifier for the evaluation.

created_at: Int64

The Unix timestamp (in seconds) for when the eval was created.

object: String

The object type.

data openai_eval

optional Expand Collapse
eval_id?: String
find_one_by?: Attributes
order?: String

Sort order for evals by timestamp. Use asc for ascending order or desc for descending order.

order_by?: String

Evals can be ordered by creation time or last updated time. Use created_at for creation time or updated_at for last updated time.

computed Expand Collapse
id: String
created_at: Int64

The Unix timestamp (in seconds) for when the eval was created.

name: String

The name of the evaluation.

object: String

The object type.

metadata: Map[String]

Set of 16 key-value pairs that can be attached to an object. This can be useful for storing additional information about the object in a structured format, and querying for objects via API or the dashboard.

Keys are strings with a maximum length of 64 characters. Values are strings with a maximum length of 512 characters.

data_source_config: Attributes

Configuration of data sources used in runs of the evaluation.

schema: Map[JSON]

The json schema for the run data source items. Learn how to build JSON schemas here.

type: String

The type of data source. Always custom.

metadata: Map[String]

Set of 16 key-value pairs that can be attached to an object. This can be useful for storing additional information about the object in a structured format, and querying for objects via API or the dashboard.

Keys are strings with a maximum length of 64 characters. Values are strings with a maximum length of 512 characters.

testing_criteria: List[Attributes]

A list of testing criteria.

input: List[Attributes]
content: String

Inputs to the model - can contain template strings. Supports text, output text, input images, and input audio, either as a single item or an array of items.

role: String

The role of the message input. One of user, assistant, system, or developer.

type: String

The type of the message input. Always message.

labels: List[String]

The labels to assign to each item in the evaluation.

model: String

The model to use for the evaluation. Must support structured outputs.

name: String

The name of the grader.

passing_labels: List[String]

The labels that indicate a passing result. Must be a subset of labels.

type: String

The object type, which is always label_model.

operation: String

The string check operation to perform. One of eq, ne, like, or ilike.

reference: String

The reference text. This may include template strings.

evaluation_metric: String

The evaluation metric to use. One of cosine, fuzzy_match, bleu, gleu, meteor, rouge_1, rouge_2, rouge_3, rouge_4, rouge_5, or rouge_l.

pass_threshold: Float64

The threshold for the score.

source: String

The source code of the python script.

image_tag: String

The image tag to use for the python script.

range: List[Float64]

The range of the score. Defaults to [0, 1].

sampling_params: Attributes

The sampling parameters for the model.

max_completions_tokens: Int64

The maximum number of tokens the grader model may generate in its response.

reasoning_effort: String

Constrains effort on reasoning for reasoning models. Currently supported values are none, minimal, low, medium, high, and xhigh. Reducing reasoning effort can result in faster responses and fewer tokens used on reasoning in a response.

  • gpt-5.1 defaults to none, which does not perform reasoning. The supported reasoning values for gpt-5.1 are none, low, medium, and high. Tool calls are supported for all reasoning values in gpt-5.1.
  • All models before gpt-5.1 default to medium reasoning effort, and do not support none.
  • The gpt-5-pro model defaults to (and only supports) high reasoning effort.
  • xhigh is supported for all models after gpt-5.1-codex-max.
seed: Int64

A seed value to initialize the randomness, during sampling.

temperature: Float64

A higher temperature increases randomness in the outputs.

top_p: Float64

An alternative to temperature for nucleus sampling; 1.0 includes all tokens.

data openai_evals

optional Expand Collapse
order?: String

Sort order for evals by timestamp. Use asc for ascending order or desc for descending order.

order_by?: String

Evals can be ordered by creation time or last updated time. Use created_at for creation time or updated_at for last updated time.

max_items?: Int64

Max items to fetch, default: 1000

computed Expand Collapse
items: List[Attributes]

The items returned by the data source

id: String

Unique identifier for the evaluation.

created_at: Int64

The Unix timestamp (in seconds) for when the eval was created.

data_source_config: Attributes

Configuration of data sources used in runs of the evaluation.

schema: Map[JSON]

The json schema for the run data source items. Learn how to build JSON schemas here.

type: String

The type of data source. Always custom.

metadata: Map[String]

Set of 16 key-value pairs that can be attached to an object. This can be useful for storing additional information about the object in a structured format, and querying for objects via API or the dashboard.

Keys are strings with a maximum length of 64 characters. Values are strings with a maximum length of 512 characters.

metadata: Map[String]

Set of 16 key-value pairs that can be attached to an object. This can be useful for storing additional information about the object in a structured format, and querying for objects via API or the dashboard.

Keys are strings with a maximum length of 64 characters. Values are strings with a maximum length of 512 characters.

name: String

The name of the evaluation.

object: String

The object type.

testing_criteria: List[Attributes]

A list of testing criteria.

input: List[Attributes]
content: String

Inputs to the model - can contain template strings. Supports text, output text, input images, and input audio, either as a single item or an array of items.

role: String

The role of the message input. One of user, assistant, system, or developer.

type: String

The type of the message input. Always message.

labels: List[String]

The labels to assign to each item in the evaluation.

model: String

The model to use for the evaluation. Must support structured outputs.

name: String

The name of the grader.

passing_labels: List[String]

The labels that indicate a passing result. Must be a subset of labels.

type: String

The object type, which is always label_model.

operation: String

The string check operation to perform. One of eq, ne, like, or ilike.

reference: String

The reference text. This may include template strings.

evaluation_metric: String

The evaluation metric to use. One of cosine, fuzzy_match, bleu, gleu, meteor, rouge_1, rouge_2, rouge_3, rouge_4, rouge_5, or rouge_l.

pass_threshold: Float64

The threshold for the score.

source: String

The source code of the python script.

image_tag: String

The image tag to use for the python script.

range: List[Float64]

The range of the score. Defaults to [0, 1].

sampling_params: Attributes

The sampling parameters for the model.

max_completions_tokens: Int64

The maximum number of tokens the grader model may generate in its response.

reasoning_effort: String

Constrains effort on reasoning for reasoning models. Currently supported values are none, minimal, low, medium, high, and xhigh. Reducing reasoning effort can result in faster responses and fewer tokens used on reasoning in a response.

  • gpt-5.1 defaults to none, which does not perform reasoning. The supported reasoning values for gpt-5.1 are none, low, medium, and high. Tool calls are supported for all reasoning values in gpt-5.1.
  • All models before gpt-5.1 default to medium reasoning effort, and do not support none.
  • The gpt-5-pro model defaults to (and only supports) high reasoning effort.
  • xhigh is supported for all models after gpt-5.1-codex-max.
seed: Int64

A seed value to initialize the randomness, during sampling.

temperature: Float64

A higher temperature increases randomness in the outputs.

top_p: Float64

An alternative to temperature for nucleus sampling; 1.0 includes all tokens.

EvalsRuns

Manage and run evals in the OpenAI platform.

resource openai_eval_run

required Expand Collapse
eval_id: String
data_source: Attributes

Details about the run’s data source.

source: Attributes

Determines what populates the item namespace in the data source.

content?: List[Attributes]

The content of the jsonl file.

item: Map[JSON]
sample?: Map[JSON]
type?: String

The type of jsonl source. Always file_content.

id?: String

The identifier of the file.

created_after?: Int64

An optional Unix timestamp to filter items created after this time.

created_before?: Int64

An optional Unix timestamp to filter items created before this time.

limit?: Int64

An optional maximum number of items to return.

metadata?: Map[String]

Set of 16 key-value pairs that can be attached to an object. This can be useful for storing additional information about the object in a structured format, and querying for objects via API or the dashboard.

Keys are strings with a maximum length of 64 characters. Values are strings with a maximum length of 512 characters.

model?: String

An optional model to filter by (e.g., ‘gpt-4o’).

reasoning_effort?: String

Constrains effort on reasoning for reasoning models. Currently supported values are none, minimal, low, medium, high, and xhigh. Reducing reasoning effort can result in faster responses and fewer tokens used on reasoning in a response.

  • gpt-5.1 defaults to none, which does not perform reasoning. The supported reasoning values for gpt-5.1 are none, low, medium, and high. Tool calls are supported for all reasoning values in gpt-5.1.
  • All models before gpt-5.1 default to medium reasoning effort, and do not support none.
  • The gpt-5-pro model defaults to (and only supports) high reasoning effort.
  • xhigh is supported for all models after gpt-5.1-codex-max.
temperature?: Float64

Sampling temperature. This is a query parameter used to select responses.

tools?: List[String]

List of tool names. This is a query parameter used to select responses.

top_p?: Float64

Nucleus sampling parameter. This is a query parameter used to select responses.

users?: List[String]

List of user identifiers. This is a query parameter used to select responses.

type?: String

The type of data source. Always jsonl.

input_messages?: Attributes

Used when sampling from a model. Dictates the structure of the messages passed into the model. Can either be a reference to a prebuilt trajectory (ie, item.input_trajectory), or a template with variable references to the item namespace.

template?: List[Attributes]

A list of chat messages forming the prompt or context. May include variable references to the item namespace, ie {{item.name}}.

content: String

Text, image, or audio input to the model, used to generate a response. Can also contain previous assistant responses.

role: String

The role of the message input. One of user, assistant, system, or developer.

phase?: String

Labels an assistant message as intermediate commentary (commentary) or the final answer (final_answer). For models like gpt-5.3-codex and beyond, when sending follow-up requests, preserve and resend phase on all assistant messages — dropping it can degrade performance. Not used for user messages.

type?: String

The type of the message input. Always message.

type: String

The type of input messages. Always template.

item_reference?: String

A reference to a variable in the item namespace. Ie, “item.input_trajectory”

model?: String

The name of the model to use for generating completions (e.g. “o3-mini”).

sampling_params?: Attributes
max_completion_tokens?: Int64

The maximum number of tokens in the generated output.

reasoning_effort?: String

Constrains effort on reasoning for reasoning models. Currently supported values are none, minimal, low, medium, high, and xhigh. Reducing reasoning effort can result in faster responses and fewer tokens used on reasoning in a response.

  • gpt-5.1 defaults to none, which does not perform reasoning. The supported reasoning values for gpt-5.1 are none, low, medium, and high. Tool calls are supported for all reasoning values in gpt-5.1.
  • All models before gpt-5.1 default to medium reasoning effort, and do not support none.
  • The gpt-5-pro model defaults to (and only supports) high reasoning effort.
  • xhigh is supported for all models after gpt-5.1-codex-max.
response_format?: Attributes

An object specifying the format that the model must output.

Setting to { "type": "json_schema", "json_schema": {...} } enables Structured Outputs which ensures the model will match your supplied JSON schema. Learn more in the Structured Outputs guide.

Setting to { "type": "json_object" } enables the older JSON mode, which ensures the message the model generates is valid JSON. Using json_schema is preferred for models that support it.

type: String

The type of response format being defined. Always text.

json_schema?: Attributes

Structured Outputs configuration options, including a JSON Schema.

name: String

The name of the response format. Must be a-z, A-Z, 0-9, or contain underscores and dashes, with a maximum length of 64.

description?: String

A description of what the response format is for, used by the model to determine how to respond in the format.

schema?: Map[JSON]

The schema for the response format, described as a JSON Schema object. Learn how to build JSON schemas here.

strict?: Bool

Whether to enable strict schema adherence when generating the output. If set to true, the model will always follow the exact schema defined in the schema field. Only a subset of JSON Schema is supported when strict is true. To learn more, read the Structured Outputs guide.

seed?: Int64

A seed value to initialize the randomness, during sampling.

temperature?: Float64

A higher temperature increases randomness in the outputs.

tools?: List[Attributes]

A list of tools the model may call. Currently, only functions are supported as a tool. Use this to provide a list of functions the model may generate JSON inputs for. A max of 128 functions are supported.

function?: Attributes
name: String

The name of the function to be called. Must be a-z, A-Z, 0-9, or contain underscores and dashes, with a maximum length of 64.

description?: String

A description of what the function does, used by the model to choose when and how to call the function.

parameters?: Map[JSON]

The parameters the functions accepts, described as a JSON Schema object. See the guide for examples, and the JSON Schema reference for documentation about the format.

Omitting parameters defines a function with an empty parameter list.

strict?: Bool

Whether to enable strict schema adherence when generating the function call. If set to true, the model will follow the exact schema defined in the parameters field. Only a subset of JSON Schema is supported when strict is true. Learn more about Structured Outputs in the function calling guide.

type?: String

The type of the tool. Currently, only function is supported.

name?: String

The name of the function to call.

parameters?: Map[JSON]

A JSON schema object describing the parameters of the function.

strict?: Bool

Whether to enforce strict parameter validation. Default true.

defer_loading?: Bool

Whether this function is deferred and loaded via tool search.

description?: String

A description of the function. Used by the model to determine whether or not to call the function.

vector_store_ids?: List[String]

The IDs of the vector stores to search.

filters?: Attributes

A filter to apply.

key?: String

The key to compare against the value.

type?: String

Specifies the comparison operator: eq, ne, gt, gte, lt, lte, in, nin.

  • eq: equals
  • ne: not equal
  • gt: greater than
  • gte: greater than or equal
  • lt: less than
  • lte: less than or equal
  • in: in
  • nin: not in
value?: String

The value to compare against the attribute key; supports string, number, or boolean types.

filters?: List[Attributes]

Array of filters to combine. Items can be ComparisonFilter or CompoundFilter.

key: String

The key to compare against the value.

type?: String

Specifies the comparison operator: eq, ne, gt, gte, lt, lte, in, nin.

  • eq: equals
  • ne: not equal
  • gt: greater than
  • gte: greater than or equal
  • lt: less than
  • lte: less than or equal
  • in: in
  • nin: not in
value: String

The value to compare against the attribute key; supports string, number, or boolean types.

allowed_domains?: List[String]

Allowed domains for the search. If not provided, all domains are allowed. Subdomains of the provided domains are allowed as well.

Example: ["pubmed.ncbi.nlm.nih.gov"]

max_num_results?: Int64

The maximum number of results to return. This number should be between 1 and 50 inclusive.

ranking_options?: Attributes

Ranking options for search.

ranker?: String

The ranker to use for the file search.

score_threshold?: Float64

The score threshold for the file search, a number between 0 and 1. Numbers closer to 1 will attempt to return only the most relevant results, but may return fewer results.

display_height?: Int64

The height of the computer display.

display_width?: Int64

The width of the computer display.

environment?: String

The type of computer environment to control.

search_context_size?: String

High level guidance for the amount of context window space to use for the search. One of low, medium, or high. medium is the default.

user_location?: Attributes

The approximate location of the user.

city?: String

Free text input for the city of the user, e.g. San Francisco.

country?: String

The two-letter ISO country code of the user, e.g. US.

region?: String

Free text input for the region of the user, e.g. California.

timezone?: String

The IANA timezone of the user, e.g. America/Los_Angeles.

type?: String

The type of location approximation. Always approximate.

server_label?: String

A label for this MCP server, used to identify it in tool calls.

allowed_tools?: List[String]

List of allowed tool names or a filter object.

authorization?: String

An OAuth access token that can be used with a remote MCP server, either with a custom MCP server URL or a service connector. Your application must handle the OAuth authorization flow and provide the token here.

connector_id?: String

Identifier for service connectors, like those available in ChatGPT. One of server_url or connector_id must be provided. Learn more about service connectors here.

Currently supported connector_id values are:

  • Dropbox: connector_dropbox
  • Gmail: connector_gmail
  • Google Calendar: connector_googlecalendar
  • Google Drive: connector_googledrive
  • Microsoft Teams: connector_microsoftteams
  • Outlook Calendar: connector_outlookcalendar
  • Outlook Email: connector_outlookemail
  • SharePoint: connector_sharepoint
headers?: Map[String]

Optional HTTP headers to send to the MCP server. Use for authentication or other purposes.

require_approval?: Attributes

Specify which of the MCP server’s tools require approval.

always?: Attributes

A filter object to specify which tools are allowed.

read_only?: Bool

Indicates whether or not a tool modifies data or is read-only. If an MCP server is annotated with readOnlyHint, it will match this filter.

tool_names?: List[String]

List of allowed tool names.

never?: Attributes

A filter object to specify which tools are allowed.

read_only?: Bool

Indicates whether or not a tool modifies data or is read-only. If an MCP server is annotated with readOnlyHint, it will match this filter.

tool_names?: List[String]

List of allowed tool names.

server_description?: String

Optional description of the MCP server, used to provide more context.

server_url?: String

The URL for the MCP server. One of server_url or connector_id must be provided.

container?: String

The code interpreter container. Can be a container ID or an object that specifies uploaded file IDs to make available to your code, along with an optional memory_limit setting.

action?: String

Whether to generate a new image or edit an existing image. Default: auto.

background?: String

Background type for the generated image. One of transparent, opaque, or auto. Default: auto.

input_fidelity?: String

Control how much effort the model will exert to match the style and features, especially facial features, of input images. This parameter is only supported for gpt-image-1 and gpt-image-1.5 and later models, unsupported for gpt-image-1-mini. Supports high and low. Defaults to low.

input_image_mask?: Attributes

Optional mask for inpainting. Contains image_url (string, optional) and file_id (string, optional).

file_id?: String

File ID for the mask image.

image_url?: String

Base64-encoded mask image.

model?: String

The image generation model to use. Default: gpt-image-1.

moderation?: String

Moderation level for the generated image. Default: auto.

output_compression?: Int64

Compression level for the output image. Default: 100.

output_format?: String

The output format of the generated image. One of png, webp, or jpeg. Default: png.

partial_images?: Int64

Number of partial images to generate in streaming mode, from 0 (default value) to 3.

quality?: String

The quality of the generated image. One of low, medium, high, or auto. Default: auto.

size?: String

The size of the generated image. One of 1024x1024, 1024x1536, 1536x1024, or auto. Default: auto.

format?: Attributes

The input format for the custom tool. Default is unconstrained text.

type?: String

Unconstrained text format. Always text.

definition?: String

The grammar definition.

syntax?: String

The syntax of the grammar definition. One of lark or regex.

tools?: List[Attributes]

The function/custom tools available inside this namespace.

name: String
type?: String
defer_loading?: Bool

Whether this function should be deferred and discovered via tool search.

description?: String
parameters?: JSON
strict?: Bool
format?: Attributes

The input format for the custom tool. Default is unconstrained text.

type?: String

Unconstrained text format. Always text.

definition?: String

The grammar definition.

syntax?: String

The syntax of the grammar definition. One of lark or regex.

execution?: String

Whether tool search is executed by the server or by the client.

search_content_types?: List[String]
top_p?: Float64

An alternative to temperature for nucleus sampling; 1.0 includes all tokens.

text?: Attributes

Configuration options for a text response from the model. Can be plain text or structured JSON data. Learn more:

format?: Attributes

An object specifying the format that the model must output.

Configuring { "type": "json_schema" } enables Structured Outputs, which ensures the model will match your supplied JSON schema. Learn more in the Structured Outputs guide.

The default format is { "type": "text" } with no additional options.

Not recommended for gpt-4o and newer models:

Setting to { "type": "json_object" } enables the older JSON mode, which ensures the message the model generates is valid JSON. Using json_schema is preferred for models that support it.

type: String

The type of response format being defined. Always text.

name?: String

The name of the response format. Must be a-z, A-Z, 0-9, or contain underscores and dashes, with a maximum length of 64.

schema?: Map[JSON]

The schema for the response format, described as a JSON Schema object. Learn how to build JSON schemas here.

description?: String

A description of what the response format is for, used by the model to determine how to respond in the format.

strict?: Bool

Whether to enable strict schema adherence when generating the output. If set to true, the model will always follow the exact schema defined in the schema field. Only a subset of JSON Schema is supported when strict is true. To learn more, read the Structured Outputs guide.

optional Expand Collapse
name?: String

The name of the run.

metadata?: Map[String]

Set of 16 key-value pairs that can be attached to an object. This can be useful for storing additional information about the object in a structured format, and querying for objects via API or the dashboard.

Keys are strings with a maximum length of 64 characters. Values are strings with a maximum length of 512 characters.

computed Expand Collapse
id: String

Unique identifier for the evaluation run.

created_at: Int64

Unix timestamp (in seconds) when the evaluation run was created.

model: String

The model that is evaluated, if applicable.

object: String

The type of the object. Always “eval.run”.

report_url: String

The URL to the rendered evaluation run report on the UI dashboard.

status: String

The status of the evaluation run.

error: Attributes

An object representing an error response from the Eval API.

code: String

The error code.

message: String

The error message.

per_model_usage: List[Attributes]

Usage statistics for each model during the evaluation run.

cached_tokens: Int64

The number of tokens retrieved from cache.

completion_tokens: Int64

The number of completion tokens generated.

invocation_count: Int64

The number of invocations.

model_name: String

The name of the model.

prompt_tokens: Int64

The number of prompt tokens used.

total_tokens: Int64

The total number of tokens used.

per_testing_criteria_results: List[Attributes]

Results per testing criteria applied during the evaluation run.

failed: Int64

Number of tests failed for this criteria.

passed: Int64

Number of tests passed for this criteria.

testing_criteria: String

A description of the testing criteria.

result_counts: Attributes

Counters summarizing the outcomes of the evaluation run.

errored: Int64

Number of output items that resulted in an error.

failed: Int64

Number of output items that failed to pass the evaluation.

passed: Int64

Number of output items that passed the evaluation.

total: Int64

Total number of executed output items.

data openai_eval_run

required Expand Collapse
eval_id: String
optional Expand Collapse
run_id?: String
find_one_by?: Attributes
order?: String

Sort order for runs by timestamp. Use asc for ascending order or desc for descending order. Defaults to asc.

status?: String

Filter runs by status. One of queued | in_progress | failed | completed | canceled.

computed Expand Collapse
id: String
created_at: Int64

Unix timestamp (in seconds) when the evaluation run was created.

model: String

The model that is evaluated, if applicable.

name: String

The name of the evaluation run.

object: String

The type of the object. Always “eval.run”.

report_url: String

The URL to the rendered evaluation run report on the UI dashboard.

status: String

The status of the evaluation run.

metadata: Map[String]

Set of 16 key-value pairs that can be attached to an object. This can be useful for storing additional information about the object in a structured format, and querying for objects via API or the dashboard.

Keys are strings with a maximum length of 64 characters. Values are strings with a maximum length of 512 characters.

data_source: Attributes

Information about the run’s data source.

source: Attributes

Determines what populates the item namespace in the data source.

content: List[Attributes]

The content of the jsonl file.

item: Map[JSON]
sample: Map[JSON]
type: String

The type of jsonl source. Always file_content.

id: String

The identifier of the file.

created_after: Int64

An optional Unix timestamp to filter items created after this time.

created_before: Int64

An optional Unix timestamp to filter items created before this time.

limit: Int64

An optional maximum number of items to return.

metadata: Map[String]

Set of 16 key-value pairs that can be attached to an object. This can be useful for storing additional information about the object in a structured format, and querying for objects via API or the dashboard.

Keys are strings with a maximum length of 64 characters. Values are strings with a maximum length of 512 characters.

model: String

An optional model to filter by (e.g., ‘gpt-4o’).

reasoning_effort: String

Constrains effort on reasoning for reasoning models. Currently supported values are none, minimal, low, medium, high, and xhigh. Reducing reasoning effort can result in faster responses and fewer tokens used on reasoning in a response.

  • gpt-5.1 defaults to none, which does not perform reasoning. The supported reasoning values for gpt-5.1 are none, low, medium, and high. Tool calls are supported for all reasoning values in gpt-5.1.
  • All models before gpt-5.1 default to medium reasoning effort, and do not support none.
  • The gpt-5-pro model defaults to (and only supports) high reasoning effort.
  • xhigh is supported for all models after gpt-5.1-codex-max.
temperature: Float64

Sampling temperature. This is a query parameter used to select responses.

tools: List[String]

List of tool names. This is a query parameter used to select responses.

top_p: Float64

Nucleus sampling parameter. This is a query parameter used to select responses.

users: List[String]

List of user identifiers. This is a query parameter used to select responses.

type: String

The type of data source. Always jsonl.

input_messages: Attributes

Used when sampling from a model. Dictates the structure of the messages passed into the model. Can either be a reference to a prebuilt trajectory (ie, item.input_trajectory), or a template with variable references to the item namespace.

template: List[Attributes]

A list of chat messages forming the prompt or context. May include variable references to the item namespace, ie {{item.name}}.

content: String

Text, image, or audio input to the model, used to generate a response. Can also contain previous assistant responses.

role: String

The role of the message input. One of user, assistant, system, or developer.

phase: String

Labels an assistant message as intermediate commentary (commentary) or the final answer (final_answer). For models like gpt-5.3-codex and beyond, when sending follow-up requests, preserve and resend phase on all assistant messages — dropping it can degrade performance. Not used for user messages.

type: String

The type of the message input. Always message.

type: String

The type of input messages. Always template.

item_reference: String

A reference to a variable in the item namespace. Ie, “item.input_trajectory”

model: String

The name of the model to use for generating completions (e.g. “o3-mini”).

sampling_params: Attributes
max_completion_tokens: Int64

The maximum number of tokens in the generated output.

reasoning_effort: String

Constrains effort on reasoning for reasoning models. Currently supported values are none, minimal, low, medium, high, and xhigh. Reducing reasoning effort can result in faster responses and fewer tokens used on reasoning in a response.

  • gpt-5.1 defaults to none, which does not perform reasoning. The supported reasoning values for gpt-5.1 are none, low, medium, and high. Tool calls are supported for all reasoning values in gpt-5.1.
  • All models before gpt-5.1 default to medium reasoning effort, and do not support none.
  • The gpt-5-pro model defaults to (and only supports) high reasoning effort.
  • xhigh is supported for all models after gpt-5.1-codex-max.
response_format: Attributes

An object specifying the format that the model must output.

Setting to { "type": "json_schema", "json_schema": {...} } enables Structured Outputs which ensures the model will match your supplied JSON schema. Learn more in the Structured Outputs guide.

Setting to { "type": "json_object" } enables the older JSON mode, which ensures the message the model generates is valid JSON. Using json_schema is preferred for models that support it.

type: String

The type of response format being defined. Always text.

json_schema: Attributes

Structured Outputs configuration options, including a JSON Schema.

name: String

The name of the response format. Must be a-z, A-Z, 0-9, or contain underscores and dashes, with a maximum length of 64.

description: String

A description of what the response format is for, used by the model to determine how to respond in the format.

schema: Map[JSON]

The schema for the response format, described as a JSON Schema object. Learn how to build JSON schemas here.

strict: Bool

Whether to enable strict schema adherence when generating the output. If set to true, the model will always follow the exact schema defined in the schema field. Only a subset of JSON Schema is supported when strict is true. To learn more, read the Structured Outputs guide.

seed: Int64

A seed value to initialize the randomness, during sampling.

temperature: Float64

A higher temperature increases randomness in the outputs.

tools: List[Attributes]

A list of tools the model may call. Currently, only functions are supported as a tool. Use this to provide a list of functions the model may generate JSON inputs for. A max of 128 functions are supported.

function: Attributes
name: String

The name of the function to be called. Must be a-z, A-Z, 0-9, or contain underscores and dashes, with a maximum length of 64.

description: String

A description of what the function does, used by the model to choose when and how to call the function.

parameters: Map[JSON]

The parameters the functions accepts, described as a JSON Schema object. See the guide for examples, and the JSON Schema reference for documentation about the format.

Omitting parameters defines a function with an empty parameter list.

strict: Bool

Whether to enable strict schema adherence when generating the function call. If set to true, the model will follow the exact schema defined in the parameters field. Only a subset of JSON Schema is supported when strict is true. Learn more about Structured Outputs in the function calling guide.

type: String

The type of the tool. Currently, only function is supported.

name: String

The name of the function to call.

parameters: Map[JSON]

A JSON schema object describing the parameters of the function.

strict: Bool

Whether to enforce strict parameter validation. Default true.

defer_loading: Bool

Whether this function is deferred and loaded via tool search.

description: String

A description of the function. Used by the model to determine whether or not to call the function.

vector_store_ids: List[String]

The IDs of the vector stores to search.

filters: Attributes

A filter to apply.

key: String

The key to compare against the value.

type: String

Specifies the comparison operator: eq, ne, gt, gte, lt, lte, in, nin.

  • eq: equals
  • ne: not equal
  • gt: greater than
  • gte: greater than or equal
  • lt: less than
  • lte: less than or equal
  • in: in
  • nin: not in
value: String

The value to compare against the attribute key; supports string, number, or boolean types.

filters: List[Attributes]

Array of filters to combine. Items can be ComparisonFilter or CompoundFilter.

key: String

The key to compare against the value.

type: String

Specifies the comparison operator: eq, ne, gt, gte, lt, lte, in, nin.

  • eq: equals
  • ne: not equal
  • gt: greater than
  • gte: greater than or equal
  • lt: less than
  • lte: less than or equal
  • in: in
  • nin: not in
value: String

The value to compare against the attribute key; supports string, number, or boolean types.

allowed_domains: List[String]

Allowed domains for the search. If not provided, all domains are allowed. Subdomains of the provided domains are allowed as well.

Example: ["pubmed.ncbi.nlm.nih.gov"]

max_num_results: Int64

The maximum number of results to return. This number should be between 1 and 50 inclusive.

ranking_options: Attributes

Ranking options for search.

ranker: String

The ranker to use for the file search.

score_threshold: Float64

The score threshold for the file search, a number between 0 and 1. Numbers closer to 1 will attempt to return only the most relevant results, but may return fewer results.

display_height: Int64

The height of the computer display.

display_width: Int64

The width of the computer display.

environment: String

The type of computer environment to control.

search_context_size: String

High level guidance for the amount of context window space to use for the search. One of low, medium, or high. medium is the default.

user_location: Attributes

The approximate location of the user.

city: String

Free text input for the city of the user, e.g. San Francisco.

country: String

The two-letter ISO country code of the user, e.g. US.

region: String

Free text input for the region of the user, e.g. California.

timezone: String

The IANA timezone of the user, e.g. America/Los_Angeles.

type: String

The type of location approximation. Always approximate.

server_label: String

A label for this MCP server, used to identify it in tool calls.

allowed_tools: List[String]

List of allowed tool names or a filter object.

authorization: String

An OAuth access token that can be used with a remote MCP server, either with a custom MCP server URL or a service connector. Your application must handle the OAuth authorization flow and provide the token here.

connector_id: String

Identifier for service connectors, like those available in ChatGPT. One of server_url or connector_id must be provided. Learn more about service connectors here.

Currently supported connector_id values are:

  • Dropbox: connector_dropbox
  • Gmail: connector_gmail
  • Google Calendar: connector_googlecalendar
  • Google Drive: connector_googledrive
  • Microsoft Teams: connector_microsoftteams
  • Outlook Calendar: connector_outlookcalendar
  • Outlook Email: connector_outlookemail
  • SharePoint: connector_sharepoint
headers: Map[String]

Optional HTTP headers to send to the MCP server. Use for authentication or other purposes.

require_approval: Attributes

Specify which of the MCP server’s tools require approval.

always: Attributes

A filter object to specify which tools are allowed.

read_only: Bool

Indicates whether or not a tool modifies data or is read-only. If an MCP server is annotated with readOnlyHint, it will match this filter.

tool_names: List[String]

List of allowed tool names.

never: Attributes

A filter object to specify which tools are allowed.

read_only: Bool

Indicates whether or not a tool modifies data or is read-only. If an MCP server is annotated with readOnlyHint, it will match this filter.

tool_names: List[String]

List of allowed tool names.

server_description: String

Optional description of the MCP server, used to provide more context.

server_url: String

The URL for the MCP server. One of server_url or connector_id must be provided.

container: String

The code interpreter container. Can be a container ID or an object that specifies uploaded file IDs to make available to your code, along with an optional memory_limit setting.

action: String

Whether to generate a new image or edit an existing image. Default: auto.

background: String

Background type for the generated image. One of transparent, opaque, or auto. Default: auto.

input_fidelity: String

Control how much effort the model will exert to match the style and features, especially facial features, of input images. This parameter is only supported for gpt-image-1 and gpt-image-1.5 and later models, unsupported for gpt-image-1-mini. Supports high and low. Defaults to low.

input_image_mask: Attributes

Optional mask for inpainting. Contains image_url (string, optional) and file_id (string, optional).

file_id: String

File ID for the mask image.

image_url: String

Base64-encoded mask image.

model: String

The image generation model to use. Default: gpt-image-1.

moderation: String

Moderation level for the generated image. Default: auto.

output_compression: Int64

Compression level for the output image. Default: 100.

output_format: String

The output format of the generated image. One of png, webp, or jpeg. Default: png.

partial_images: Int64

Number of partial images to generate in streaming mode, from 0 (default value) to 3.

quality: String

The quality of the generated image. One of low, medium, high, or auto. Default: auto.

size: String

The size of the generated image. One of 1024x1024, 1024x1536, 1536x1024, or auto. Default: auto.

format: Attributes

The input format for the custom tool. Default is unconstrained text.

type: String

Unconstrained text format. Always text.

definition: String

The grammar definition.

syntax: String

The syntax of the grammar definition. One of lark or regex.

tools: List[Attributes]

The function/custom tools available inside this namespace.

name: String
type: String
defer_loading: Bool

Whether this function should be deferred and discovered via tool search.

description: String
parameters: JSON
strict: Bool
format: Attributes

The input format for the custom tool. Default is unconstrained text.

type: String

Unconstrained text format. Always text.

definition: String

The grammar definition.

syntax: String

The syntax of the grammar definition. One of lark or regex.

execution: String

Whether tool search is executed by the server or by the client.

search_content_types: List[String]
top_p: Float64

An alternative to temperature for nucleus sampling; 1.0 includes all tokens.

text: Attributes

Configuration options for a text response from the model. Can be plain text or structured JSON data. Learn more:

format: Attributes

An object specifying the format that the model must output.

Configuring { "type": "json_schema" } enables Structured Outputs, which ensures the model will match your supplied JSON schema. Learn more in the Structured Outputs guide.

The default format is { "type": "text" } with no additional options.

Not recommended for gpt-4o and newer models:

Setting to { "type": "json_object" } enables the older JSON mode, which ensures the message the model generates is valid JSON. Using json_schema is preferred for models that support it.

type: String

The type of response format being defined. Always text.

name: String

The name of the response format. Must be a-z, A-Z, 0-9, or contain underscores and dashes, with a maximum length of 64.

schema: Map[JSON]

The schema for the response format, described as a JSON Schema object. Learn how to build JSON schemas here.

description: String

A description of what the response format is for, used by the model to determine how to respond in the format.

strict: Bool

Whether to enable strict schema adherence when generating the output. If set to true, the model will always follow the exact schema defined in the schema field. Only a subset of JSON Schema is supported when strict is true. To learn more, read the Structured Outputs guide.

error: Attributes

An object representing an error response from the Eval API.

code: String

The error code.

message: String

The error message.

per_model_usage: List[Attributes]

Usage statistics for each model during the evaluation run.

cached_tokens: Int64

The number of tokens retrieved from cache.

completion_tokens: Int64

The number of completion tokens generated.

invocation_count: Int64

The number of invocations.

model_name: String

The name of the model.

prompt_tokens: Int64

The number of prompt tokens used.

total_tokens: Int64

The total number of tokens used.

per_testing_criteria_results: List[Attributes]

Results per testing criteria applied during the evaluation run.

failed: Int64

Number of tests failed for this criteria.

passed: Int64

Number of tests passed for this criteria.

testing_criteria: String

A description of the testing criteria.

result_counts: Attributes

Counters summarizing the outcomes of the evaluation run.

errored: Int64

Number of output items that resulted in an error.

failed: Int64

Number of output items that failed to pass the evaluation.

passed: Int64

Number of output items that passed the evaluation.

total: Int64

Total number of executed output items.

data openai_eval_runs

required Expand Collapse
eval_id: String
optional Expand Collapse
status?: String

Filter runs by status. One of queued | in_progress | failed | completed | canceled.

order?: String

Sort order for runs by timestamp. Use asc for ascending order or desc for descending order. Defaults to asc.

max_items?: Int64

Max items to fetch, default: 1000

computed Expand Collapse
items: List[Attributes]

The items returned by the data source

id: String

Unique identifier for the evaluation run.

created_at: Int64

Unix timestamp (in seconds) when the evaluation run was created.

data_source: Attributes

Information about the run’s data source.

source: Attributes

Determines what populates the item namespace in the data source.

content: List[Attributes]

The content of the jsonl file.

item: Map[JSON]
sample: Map[JSON]
type: String

The type of jsonl source. Always file_content.

id: String

The identifier of the file.

created_after: Int64

An optional Unix timestamp to filter items created after this time.

created_before: Int64

An optional Unix timestamp to filter items created before this time.

limit: Int64

An optional maximum number of items to return.

metadata: Map[String]

Set of 16 key-value pairs that can be attached to an object. This can be useful for storing additional information about the object in a structured format, and querying for objects via API or the dashboard.

Keys are strings with a maximum length of 64 characters. Values are strings with a maximum length of 512 characters.

model: String

An optional model to filter by (e.g., ‘gpt-4o’).

reasoning_effort: String

Constrains effort on reasoning for reasoning models. Currently supported values are none, minimal, low, medium, high, and xhigh. Reducing reasoning effort can result in faster responses and fewer tokens used on reasoning in a response.

  • gpt-5.1 defaults to none, which does not perform reasoning. The supported reasoning values for gpt-5.1 are none, low, medium, and high. Tool calls are supported for all reasoning values in gpt-5.1.
  • All models before gpt-5.1 default to medium reasoning effort, and do not support none.
  • The gpt-5-pro model defaults to (and only supports) high reasoning effort.
  • xhigh is supported for all models after gpt-5.1-codex-max.
temperature: Float64

Sampling temperature. This is a query parameter used to select responses.

tools: List[String]

List of tool names. This is a query parameter used to select responses.

top_p: Float64

Nucleus sampling parameter. This is a query parameter used to select responses.

users: List[String]

List of user identifiers. This is a query parameter used to select responses.

type: String

The type of data source. Always jsonl.

input_messages: Attributes

Used when sampling from a model. Dictates the structure of the messages passed into the model. Can either be a reference to a prebuilt trajectory (ie, item.input_trajectory), or a template with variable references to the item namespace.

template: List[Attributes]

A list of chat messages forming the prompt or context. May include variable references to the item namespace, ie {{item.name}}.

content: String

Text, image, or audio input to the model, used to generate a response. Can also contain previous assistant responses.

role: String

The role of the message input. One of user, assistant, system, or developer.

phase: String

Labels an assistant message as intermediate commentary (commentary) or the final answer (final_answer). For models like gpt-5.3-codex and beyond, when sending follow-up requests, preserve and resend phase on all assistant messages — dropping it can degrade performance. Not used for user messages.

type: String

The type of the message input. Always message.

type: String

The type of input messages. Always template.

item_reference: String

A reference to a variable in the item namespace. Ie, “item.input_trajectory”

model: String

The name of the model to use for generating completions (e.g. “o3-mini”).

sampling_params: Attributes
max_completion_tokens: Int64

The maximum number of tokens in the generated output.

reasoning_effort: String

Constrains effort on reasoning for reasoning models. Currently supported values are none, minimal, low, medium, high, and xhigh. Reducing reasoning effort can result in faster responses and fewer tokens used on reasoning in a response.

  • gpt-5.1 defaults to none, which does not perform reasoning. The supported reasoning values for gpt-5.1 are none, low, medium, and high. Tool calls are supported for all reasoning values in gpt-5.1.
  • All models before gpt-5.1 default to medium reasoning effort, and do not support none.
  • The gpt-5-pro model defaults to (and only supports) high reasoning effort.
  • xhigh is supported for all models after gpt-5.1-codex-max.
response_format: Attributes

An object specifying the format that the model must output.

Setting to { "type": "json_schema", "json_schema": {...} } enables Structured Outputs which ensures the model will match your supplied JSON schema. Learn more in the Structured Outputs guide.

Setting to { "type": "json_object" } enables the older JSON mode, which ensures the message the model generates is valid JSON. Using json_schema is preferred for models that support it.

type: String

The type of response format being defined. Always text.

json_schema: Attributes

Structured Outputs configuration options, including a JSON Schema.

name: String

The name of the response format. Must be a-z, A-Z, 0-9, or contain underscores and dashes, with a maximum length of 64.

description: String

A description of what the response format is for, used by the model to determine how to respond in the format.

schema: Map[JSON]

The schema for the response format, described as a JSON Schema object. Learn how to build JSON schemas here.

strict: Bool

Whether to enable strict schema adherence when generating the output. If set to true, the model will always follow the exact schema defined in the schema field. Only a subset of JSON Schema is supported when strict is true. To learn more, read the Structured Outputs guide.

seed: Int64

A seed value to initialize the randomness, during sampling.

temperature: Float64

A higher temperature increases randomness in the outputs.

tools: List[Attributes]

A list of tools the model may call. Currently, only functions are supported as a tool. Use this to provide a list of functions the model may generate JSON inputs for. A max of 128 functions are supported.

function: Attributes
name: String

The name of the function to be called. Must be a-z, A-Z, 0-9, or contain underscores and dashes, with a maximum length of 64.

description: String

A description of what the function does, used by the model to choose when and how to call the function.

parameters: Map[JSON]

The parameters the functions accepts, described as a JSON Schema object. See the guide for examples, and the JSON Schema reference for documentation about the format.

Omitting parameters defines a function with an empty parameter list.

strict: Bool

Whether to enable strict schema adherence when generating the function call. If set to true, the model will follow the exact schema defined in the parameters field. Only a subset of JSON Schema is supported when strict is true. Learn more about Structured Outputs in the function calling guide.

type: String

The type of the tool. Currently, only function is supported.

name: String

The name of the function to call.

parameters: Map[JSON]

A JSON schema object describing the parameters of the function.

strict: Bool

Whether to enforce strict parameter validation. Default true.

defer_loading: Bool

Whether this function is deferred and loaded via tool search.

description: String

A description of the function. Used by the model to determine whether or not to call the function.

vector_store_ids: List[String]

The IDs of the vector stores to search.

filters: Attributes

A filter to apply.

key: String

The key to compare against the value.

type: String

Specifies the comparison operator: eq, ne, gt, gte, lt, lte, in, nin.

  • eq: equals
  • ne: not equal
  • gt: greater than
  • gte: greater than or equal
  • lt: less than
  • lte: less than or equal
  • in: in
  • nin: not in
value: String

The value to compare against the attribute key; supports string, number, or boolean types.

filters: List[Attributes]

Array of filters to combine. Items can be ComparisonFilter or CompoundFilter.

key: String

The key to compare against the value.

type: String

Specifies the comparison operator: eq, ne, gt, gte, lt, lte, in, nin.

  • eq: equals
  • ne: not equal
  • gt: greater than
  • gte: greater than or equal
  • lt: less than
  • lte: less than or equal
  • in: in
  • nin: not in
value: String

The value to compare against the attribute key; supports string, number, or boolean types.

allowed_domains: List[String]

Allowed domains for the search. If not provided, all domains are allowed. Subdomains of the provided domains are allowed as well.

Example: ["pubmed.ncbi.nlm.nih.gov"]

max_num_results: Int64

The maximum number of results to return. This number should be between 1 and 50 inclusive.

ranking_options: Attributes

Ranking options for search.

ranker: String

The ranker to use for the file search.

score_threshold: Float64

The score threshold for the file search, a number between 0 and 1. Numbers closer to 1 will attempt to return only the most relevant results, but may return fewer results.

display_height: Int64

The height of the computer display.

display_width: Int64

The width of the computer display.

environment: String

The type of computer environment to control.

search_context_size: String

High level guidance for the amount of context window space to use for the search. One of low, medium, or high. medium is the default.

user_location: Attributes

The approximate location of the user.

city: String

Free text input for the city of the user, e.g. San Francisco.

country: String

The two-letter ISO country code of the user, e.g. US.

region: String

Free text input for the region of the user, e.g. California.

timezone: String

The IANA timezone of the user, e.g. America/Los_Angeles.

type: String

The type of location approximation. Always approximate.

server_label: String

A label for this MCP server, used to identify it in tool calls.

allowed_tools: List[String]

List of allowed tool names or a filter object.

authorization: String

An OAuth access token that can be used with a remote MCP server, either with a custom MCP server URL or a service connector. Your application must handle the OAuth authorization flow and provide the token here.

connector_id: String

Identifier for service connectors, like those available in ChatGPT. One of server_url or connector_id must be provided. Learn more about service connectors here.

Currently supported connector_id values are:

  • Dropbox: connector_dropbox
  • Gmail: connector_gmail
  • Google Calendar: connector_googlecalendar
  • Google Drive: connector_googledrive
  • Microsoft Teams: connector_microsoftteams
  • Outlook Calendar: connector_outlookcalendar
  • Outlook Email: connector_outlookemail
  • SharePoint: connector_sharepoint
headers: Map[String]

Optional HTTP headers to send to the MCP server. Use for authentication or other purposes.

require_approval: Attributes

Specify which of the MCP server’s tools require approval.

always: Attributes

A filter object to specify which tools are allowed.

read_only: Bool

Indicates whether or not a tool modifies data or is read-only. If an MCP server is annotated with readOnlyHint, it will match this filter.

tool_names: List[String]

List of allowed tool names.

never: Attributes

A filter object to specify which tools are allowed.

read_only: Bool

Indicates whether or not a tool modifies data or is read-only. If an MCP server is annotated with readOnlyHint, it will match this filter.

tool_names: List[String]

List of allowed tool names.

server_description: String

Optional description of the MCP server, used to provide more context.

server_url: String

The URL for the MCP server. One of server_url or connector_id must be provided.

container: String

The code interpreter container. Can be a container ID or an object that specifies uploaded file IDs to make available to your code, along with an optional memory_limit setting.

action: String

Whether to generate a new image or edit an existing image. Default: auto.

background: String

Background type for the generated image. One of transparent, opaque, or auto. Default: auto.

input_fidelity: String

Control how much effort the model will exert to match the style and features, especially facial features, of input images. This parameter is only supported for gpt-image-1 and gpt-image-1.5 and later models, unsupported for gpt-image-1-mini. Supports high and low. Defaults to low.

input_image_mask: Attributes

Optional mask for inpainting. Contains image_url (string, optional) and file_id (string, optional).

file_id: String

File ID for the mask image.

image_url: String

Base64-encoded mask image.

model: String

The image generation model to use. Default: gpt-image-1.

moderation: String

Moderation level for the generated image. Default: auto.

output_compression: Int64

Compression level for the output image. Default: 100.

output_format: String

The output format of the generated image. One of png, webp, or jpeg. Default: png.

partial_images: Int64

Number of partial images to generate in streaming mode, from 0 (default value) to 3.

quality: String

The quality of the generated image. One of low, medium, high, or auto. Default: auto.

size: String

The size of the generated image. One of 1024x1024, 1024x1536, 1536x1024, or auto. Default: auto.

format: Attributes

The input format for the custom tool. Default is unconstrained text.

type: String

Unconstrained text format. Always text.

definition: String

The grammar definition.

syntax: String

The syntax of the grammar definition. One of lark or regex.

tools: List[Attributes]

The function/custom tools available inside this namespace.

name: String
type: String
defer_loading: Bool

Whether this function should be deferred and discovered via tool search.

description: String
parameters: JSON
strict: Bool
format: Attributes

The input format for the custom tool. Default is unconstrained text.

type: String

Unconstrained text format. Always text.

definition: String

The grammar definition.

syntax: String

The syntax of the grammar definition. One of lark or regex.

execution: String

Whether tool search is executed by the server or by the client.

search_content_types: List[String]
top_p: Float64

An alternative to temperature for nucleus sampling; 1.0 includes all tokens.

text: Attributes

Configuration options for a text response from the model. Can be plain text or structured JSON data. Learn more:

format: Attributes

An object specifying the format that the model must output.

Configuring { "type": "json_schema" } enables Structured Outputs, which ensures the model will match your supplied JSON schema. Learn more in the Structured Outputs guide.

The default format is { "type": "text" } with no additional options.

Not recommended for gpt-4o and newer models:

Setting to { "type": "json_object" } enables the older JSON mode, which ensures the message the model generates is valid JSON. Using json_schema is preferred for models that support it.

type: String

The type of response format being defined. Always text.

name: String

The name of the response format. Must be a-z, A-Z, 0-9, or contain underscores and dashes, with a maximum length of 64.

schema: Map[JSON]

The schema for the response format, described as a JSON Schema object. Learn how to build JSON schemas here.

description: String

A description of what the response format is for, used by the model to determine how to respond in the format.

strict: Bool

Whether to enable strict schema adherence when generating the output. If set to true, the model will always follow the exact schema defined in the schema field. Only a subset of JSON Schema is supported when strict is true. To learn more, read the Structured Outputs guide.

error: Attributes

An object representing an error response from the Eval API.

code: String

The error code.

message: String

The error message.

eval_id: String

The identifier of the associated evaluation.

metadata: Map[String]

Set of 16 key-value pairs that can be attached to an object. This can be useful for storing additional information about the object in a structured format, and querying for objects via API or the dashboard.

Keys are strings with a maximum length of 64 characters. Values are strings with a maximum length of 512 characters.

model: String

The model that is evaluated, if applicable.

name: String

The name of the evaluation run.

object: String

The type of the object. Always “eval.run”.

per_model_usage: List[Attributes]

Usage statistics for each model during the evaluation run.

cached_tokens: Int64

The number of tokens retrieved from cache.

completion_tokens: Int64

The number of completion tokens generated.

invocation_count: Int64

The number of invocations.

model_name: String

The name of the model.

prompt_tokens: Int64

The number of prompt tokens used.

total_tokens: Int64

The total number of tokens used.

per_testing_criteria_results: List[Attributes]

Results per testing criteria applied during the evaluation run.

failed: Int64

Number of tests failed for this criteria.

passed: Int64

Number of tests passed for this criteria.

testing_criteria: String

A description of the testing criteria.

report_url: String

The URL to the rendered evaluation run report on the UI dashboard.

result_counts: Attributes

Counters summarizing the outcomes of the evaluation run.

errored: Int64

Number of output items that resulted in an error.

failed: Int64

Number of output items that failed to pass the evaluation.

passed: Int64

Number of output items that passed the evaluation.

total: Int64

Total number of executed output items.

status: String

The status of the evaluation run.

EvalsRunsOutput Items

Manage and run evals in the OpenAI platform.

data openai_eval_run_output_item

required Expand Collapse
eval_id: String
output_item_id: String
run_id: String
computed Expand Collapse
created_at: Int64

Unix timestamp (in seconds) when the evaluation run was created.

datasource_item_id: Int64

The identifier for the data source item.

id: String

Unique identifier for the evaluation run output item.

object: String

The type of the object. Always “eval.run.output_item”.

status: String

The status of the evaluation run.

datasource_item: Map[JSON]

Details of the input data source item.

results: List[Attributes]

A list of grader results for this output item.

name: String

The name of the grader.

passed: Bool

Whether the grader considered the output a pass.

score: Float64

The numeric score produced by the grader.

sample: Map[JSON]

Optional sample or intermediate data produced by the grader.

type: String

The grader type (for example, “string-check-grader”).

sample: Attributes

A sample containing the input and output of the evaluation run.

error: Attributes

An object representing an error response from the Eval API.

code: String

The error code.

message: String

The error message.

finish_reason: String

The reason why the sample generation was finished.

input: List[Attributes]

An array of input messages.

content: String

The content of the message.

role: String

The role of the message sender (e.g., system, user, developer).

max_completion_tokens: Int64

The maximum number of tokens allowed for completion.

model: String

The model used for generating the sample.

output: List[Attributes]

An array of output messages.

content: String

The content of the message.

role: String

The role of the message (e.g. “system”, “assistant”, “user”).

seed: Int64

The seed used for generating the sample.

temperature: Float64

The sampling temperature used.

top_p: Float64

The top_p value used for sampling.

usage: Attributes

Token usage details for the sample.

cached_tokens: Int64

The number of tokens retrieved from cache.

completion_tokens: Int64

The number of completion tokens generated.

prompt_tokens: Int64

The number of prompt tokens used.

total_tokens: Int64

The total number of tokens used.

data openai_eval_run_output_items

required Expand Collapse
eval_id: String
run_id: String
optional Expand Collapse
status?: String

Filter output items by status. Use failed to filter by failed output items or pass to filter by passed output items.

order?: String

Sort order for output items by timestamp. Use asc for ascending order or desc for descending order. Defaults to asc.

max_items?: Int64

Max items to fetch, default: 1000

computed Expand Collapse
items: List[Attributes]

The items returned by the data source

id: String

Unique identifier for the evaluation run output item.

created_at: Int64

Unix timestamp (in seconds) when the evaluation run was created.

datasource_item: Map[JSON]

Details of the input data source item.

datasource_item_id: Int64

The identifier for the data source item.

eval_id: String

The identifier of the evaluation group.

object: String

The type of the object. Always “eval.run.output_item”.

results: List[Attributes]

A list of grader results for this output item.

name: String

The name of the grader.

passed: Bool

Whether the grader considered the output a pass.

score: Float64

The numeric score produced by the grader.

sample: Map[JSON]

Optional sample or intermediate data produced by the grader.

type: String

The grader type (for example, “string-check-grader”).

run_id: String

The identifier of the evaluation run associated with this output item.

sample: Attributes

A sample containing the input and output of the evaluation run.

error: Attributes

An object representing an error response from the Eval API.

code: String

The error code.

message: String

The error message.

finish_reason: String

The reason why the sample generation was finished.

input: List[Attributes]

An array of input messages.

content: String

The content of the message.

role: String

The role of the message sender (e.g., system, user, developer).

max_completion_tokens: Int64

The maximum number of tokens allowed for completion.

model: String

The model used for generating the sample.

output: List[Attributes]

An array of output messages.

content: String

The content of the message.

role: String

The role of the message (e.g. “system”, “assistant”, “user”).

seed: Int64

The seed used for generating the sample.

temperature: Float64

The sampling temperature used.

top_p: Float64

The top_p value used for sampling.

usage: Attributes

Token usage details for the sample.

cached_tokens: Int64

The number of tokens retrieved from cache.

completion_tokens: Int64

The number of completion tokens generated.

prompt_tokens: Int64

The number of prompt tokens used.

total_tokens: Int64

The total number of tokens used.

status: String

The status of the evaluation run.