Codex use case
Add evals to your AI application
Use Codex to turn expected behavior into a Promptfoo eval suite.
Ask Codex to inspect your AI application, identify the behavior you want to evaluate, and add a runnable Promptfoo eval suite.
Best for
- AI applications that already have prompts, model calls, tools, retrieval, agents, or product requirements but no repeatable eval suite.
- Teams preparing a model, prompt, retrieval, or agent change and wanting regression tests before the pull request merges.
- Quality reviews where repeated manual checks should become committed eval cases.
Contents
Add evals to your AI application
Use Codex to turn expected behavior into a Promptfoo eval suite.
Ask Codex to inspect your AI application, identify the behavior you want to evaluate, and add a runnable Promptfoo eval suite.
Ask Codex to inspect your AI application, identify the behavior you want to evaluate, and add a runnable Promptfoo eval suite.
Related links
Best for
- AI applications that already have prompts, model calls, tools, retrieval, agents, or product requirements but no repeatable eval suite.
- Teams preparing a model, prompt, retrieval, or agent change and wanting regression tests before the pull request merges.
- Quality reviews where repeated manual checks should become committed eval cases.
Skills & Plugins
- Plugin that includes `$promptfoo-evals` and `$promptfoo-provider-setup` for creating, connecting, running, and QAing eval suites.
| Skill | Why use it |
|---|---|
| Promptfoo | Plugin that includes `$promptfoo-evals` and `$promptfoo-provider-setup` for creating, connecting, running, and QAing eval suites. |
Starter prompt
Introduction
When you are building an AI application, or making changes to an existing one, you want to make sure it behaves as expected. Evals are a way to systematically test a set of scenarios and catch regressions before they ship.
You can use Promptfoo to run evals on your AI application, and Codex to help you create and maintain the evals.
How to use
Use Codex with the Promptfoo plugin’s $promptfoo-evals skill to turn one AI app behavior into a repeatable eval suite. When the app does not already have a working Promptfoo target, $promptfoo-provider-setup helps connect the suite to the application path you want to test.
Codex can inspect the app, propose high-signal cases, add the Promptfoo config and test data, run the suite locally, and give you a command to keep using.
This use case works best when the behavior is concrete: support answer quality, retrieval grounding, classifier labels, tool calls, JSON shape, business rules, or prompt and model migration confidence.
A strong first pass should be reviewable code and test data: a promptfooconfig.yaml or equivalent config, a small evals/ directory, test cases, any target adapter needed to call the app, and a local command such as npm run evals.
Choose what to evaluate
Start with one user-visible promise. Avoid asking Codex to evaluate the entire AI system in one pass. A smaller suite is easier to trust, review, and keep running.
Good first targets include:
- Correctness: classification, extraction, summarization, routing, or transformation.
- Grounding: answers that should stay tied to retrieved documents or cited sources.
- Tool use: choosing the right tool, passing valid arguments, and handling tool errors.
- Format or business rules: JSON schemas, field names, business-rule limits, or UI-facing copy contracts.
- Prompt or model migration: making sure a new prompt, model, system message, or retrieval setting does not break important cases.
Start from product requirements, bug reports, support escalations, or sanitized examples your team is comfortable committing to the repo.
Ask for an eval plan
Codex should inspect before it edits. Ask for a plan that names the target path, fixtures, assertions, adapter, and commands. This gives you a chance to catch the wrong target or weak test cases before files are added.
Review the plan before implementation. It should name the app path or endpoint Promptfoo will call, the first seed cases, the assertions, the files Codex will create, the local command, and any required secrets or services. If the plan tests the raw model instead of the application path users hit, ask Codex whether that is intentional.
Implement, run, and iterate
Once the plan is correct, ask Codex to implement it. The first implementation should be boring: config, cases, fixtures, a target adapter if needed, a command, and proof that the command ran.
A small app-backed suite might look like this:
evals/
promptfooconfig.yaml
tests/
cases.yaml
providers/
provider.js # only if the built-in provider cannot call the app directly
Run the suite before changing behavior. The baseline tells you whether the app already fails the cases, whether the assertions need tuning, or whether the target adapter is wrong. Tune assertions when they are too brittle or vague, but keep real product failures visible.
After the first run, use the suite to compare app changes before they ship. Add new cases whenever a bug, launch requirement, or product review shows behavior you want to keep stable. Once the local command is stable, ask Codex to add it to CI or your release checklist.
Related use cases
Upgrade your API integration
Use Codex to update your existing OpenAI API integration to the latest recommended models...
Create a CLI Codex can use
Ask Codex to create a composable CLI it can run from any folder, combine with repo scripts...
Follow a goal
Use `/goal` when a task needs Codex to keep working across turns toward a verifiable...