Primary navigation

Legacy APIs

Agent evals

Measure agent quality with reproducible evaluations.

The OpenAI Platform offers a suite of evaluation tools to help you ensure your agents perform consistently and accurately.

For identifying errors at the workflow-level, we recommend our trace grading functionality.

For an easy way to build and iterate on your evals, we recommend exploring Datasets.

If you need advanced features such as evaluation against external models, want to interact with your eval runs via API, or want to run evaluations on a larger scale, consider using Evals instead.

Next steps

For more inspiration, visit the OpenAI Cookbook, which contains example code and links to third-party resources, or learn more about our tools for evals:

Getting started with evals: Datasets

Operate a flywheel of continuous improvement using evaluations.

Working with evals

Evaluate against external models, interact with evals via API, and more.

Prompt optimizer

Use your dataset to automatically improve your prompts.

Cookbook: Building resilient prompts with evals

Operate a flywheel of continuous improvement using evaluations.