Agent evals

The OpenAI Platform offers a suite of evaluation tools to help you ensure your agents perform consistently and accurately.

For identifying errors at the workflow-level, we recommend our trace grading functionality.

For an easy way to build and iterate on your evals, we recommend exploring Datasets.

If you need advanced features such as evaluation against external models, want to interact with your eval runs via API, or want to run evaluations on a larger scale, consider using Evals instead.

Next steps

For more inspiration, visit the OpenAI Cookbook, which contains example code and links to third-party resources, or learn more about our tools for evals:

Getting started with evals: Datasets

Operate a flywheel of continuous improvement using evaluations.

Working with evals

Evaluate against external models, interact with evals via API, and more.

Prompt optimizer

Use your dataset to automatically improve your prompts.

Cookbook: Building resilient prompts with evals

Operate a flywheel of continuous improvement using evaluations.

Search the API docs

Get started

Core concepts

Agents

Tools

Run and scale

Evaluation

Realtime API

Model optimization

Specialized models

Coding agents

Going live

Legacy APIs

Resources

Getting Started

Using Codex

Configuration

Administration

Automation

Learn

Releases

Categories

Topics

Next steps