The OpenAI Platform offers a suite of evaluation tools to help you ensure your agents perform consistently and accurately.
For identifying errors at the workflow-level, we recommend our trace grading functionality.
For an easy way to build and iterate on your evals, we recommend exploring Datasets.
If you need advanced features such as evaluation against external models, want to interact with your eval runs via API, or want to run evaluations on a larger scale, consider using Evals instead.
Next steps
For more inspiration, visit the OpenAI Cookbook, which contains example code and links to third-party resources, or learn more about our tools for evals:
Operate a flywheel of continuous improvement using evaluations.
Evaluate against external models, interact with evals via API, and more.
Use your dataset to automatically improve your prompts.
Operate a flywheel of continuous improvement using evaluations.