Primary navigation

Legacy APIs

Reinforcement fine-tuning use cases

Learn use cases and best practices for reinforcement fine-tuning.

Reinforcement fine-tuning (RFT) provides a way to improve your model’s performance at specific tasks. The task must be clear and have verifiable answers.

When to use reinforcement fine-tuning

Agentic workflows are designed to make decisions that are both correct and verifiable. RFT can help by providing explicit rubrics and using code‑based or LLM‑based graders to measure functional success, factual accuracy, or policy compliance.

Across early users, three clear use cases have emerged:

  1. Turn instructions into working code: Convert open-ended prompts into structured code, configs, or templates that must pass deterministic tests.
  2. Pull facts into a clean format: Extract verifiable facts and summaries from messy, unstructured text and return JSON-structured or other schema-based outputs.
  3. Apply complex rules correctly: Make fine-grained label or policy decisions when the information provided is nuanced, large in quantity, hierarchical, or high-stakes.

Ready to use reinforcement fine-tuning? Skip to the guide →

1. Turn instructions into working code

In this use case, models reason over hidden domain constraints to produce structured outputs like code, queries, or infrastructure templates. Outputs must satisfy multiple correctness conditions, and success is usually deterministically graded: the artifact either compiles, passes tests, or meets an explicit schema.

Wiring verification IPs for semiconductor design

Company: ChipStack is building the next-generation of AI-powered tools for chip design and verification, aimed at significantly reducing the time and cost of developing and validating complex semiconductor chips.

Problem to solve: One task that’s challenging and time-consuming for humans is binding design interfaces to verification IPs (pre-created verification components that, when properly applied, can significantly enhance quality and coverage of verification). There are many verification IPs, and each can contain dozens to hundreds of signals that may be mapped. Someone must understand this domain well in order to apply the verification IP correctly.

Objective: To train OpenAI reasoning models to do this instead, ChipStack prepared a dataset consisting of less than 50 samples, then performed several RFT variations. For the final evaluation report, they ran this evaluation set three times against each model and variation—o1-mini base and fine-tuned, o3-mini base and fine-tuned—and averaged the results per-sample then overall.

Production-ready API snippets that compile and pass AST checks

Company: Runloop is a platform for AI-powered coding agents to be deployed into production and built with public and custom benchmarking capabilities to refine performance.

Problem to solve: Runloop wanted to improve model performance at using third-party APIs, such as the Stripe API, which can be large and complex without a human in the loop. If they could train a model to use the Stripe API, Runloop could turn economically impactful business cases into working code.

Objective: Their goal was teaching the model to master usage of the Stripe API, including writing complete code snippets for arbitrary user requests by either adapting information from existing integration guides, merging information from multiple guides, or inferring information not explicitly stated in the guides. They used RFT with two primary rewards:

  1. Reward the model for outputting the answer in a Markdown format that aligns with expectation of how a “dynamic” integration guide should look.
  2. Reward the model for producing “correct” code snippets by validating the outputted code via AST Grep. This allows them to confirm the model is making the correct Stripe SDK calls with the correct parameters and in some cases even in the correct order.

Correct handling of conflicts and dupes in a schedule manager

Company: Milo helps busy parents manage chaotic family schedules by converting messy inputs—like text convos with to-dos, school newsletter PDFs, weekly reminders, sports schedule emails—into reliable calendar and list actions.

Problem to solve: Base GPT-4o prompting and SFT fell short of trust thresholds.

Objective: Milo used RFT to properly create coding tasks like event vs. list classification, recurrence rule generation, accurate updates and deletes, conflict detection, and strict output formatting. They defined a grader that checked whether generated item objects were complete, categorized correctly, and were a duplicate or had a calendar conflict.

2. Pull facts into a clean format

These tasks typically involve subtle distinctions that demand clear classification guidelines. Successful framing requires explicit and hierarchical labeling schemes defined through consensus by domain experts. Without consistent agreement, grading signals become noisy, weakening RFT effectiveness.

Assigning ICD-10 medical codes

Company: Ambience is an AI platform that eliminates administrative burden for clinicians and ensures accurate, compliant documentation across 100+ specialties, helping physicians focus on patient care while increasing documentation quality and reducing compliance risk for health systems.

Problem to solve: ICD-10 coding is one of the most intricate administrative tasks in medicine. After every patient encounter, clinicians must map each diagnosis to one of ~70,000 codes—navigating payor-specific rules on specificity, site-of-care, and mutually exclusive pairings. Errors can trigger audits and fines that stretch into nine figures.

Objective: Using reinforcement fine-tuning on OpenAI frontier models, Ambience wanted to train a reasoning system that listens to the visit audio, pulls in relevant EHR context, and recommends ICD-10 codes with accuracy exceeding expert clinicians.

Company: Harvey is building AI that legal teams trust—and that trust hinges on retrieving precisely the right evidence from a sprawling corpora of contracts, statutes, and case law. Legal professionals aren’t satisfied with models that merely generate plausible-sounding summaries or paraphrased answers. They demand verifiable citations—passages that can be traced directly back to source documents.

Problem to solve: Harvey’s clients use its models to triage litigation risk, construct legal arguments, and support due diligence for legal professionals—all tasks where a single missed or misquoted sentence can flip an outcome. Models must be able to parse long, dense legal documents and extract only the portions that matter. In practice, these inputs are often messy and inconsistent: some claims are vague, while others hinge on rare legal doctrines buried deep in boilerplate.

Objective: The task’s requirements are to interpret nuanced legal claims, navigate long-form documents, and select on-point support with verbatim excerpts.

3. Apply complex rules correctly

This use case involves pulling verifiable facts or entities from unstructured inputs into clearly defined schemas (e.g., JSON objects, condition codes, medical codes, legal citations, or financial metrics).

Successful extraction tasks typically benefit from precise, continuous grading methodologies—like span-level F1 scores, fuzzy text-matching metrics, or numeric accuracy checks—to evaluate how accurately the extracted information aligns with ground truth. Define explicit success criteria and detailed rubrics. Then, the model can achieve reliable, repeatable improvements.

Expert-level reasoning in tax analysis

Company: Accordance is building a platform for tax, audit, and CPA teams.

Problem to solve: Taxation is a highly complex domain, requiring deep reasoning across nuanced fact patterns and intricate regulations. It’s also a field that continues changing.

Objective: Accordance wanted a high-trust system for sophisticated tax scenarios while maintaining accuracy. Unlike traditional hardcoded software, it’s important that their data extraction tool adapts as the tax landscape evolves.

Enforcement of nuanced content moderation policies

Company: SafetyKit is a risk and compliance platform that helps organizations make decisions across complex content moderation workflows.

Problem to solve: These systems must handle large volumes of content and apply intricate policy logic that requires multistep reasoning. Because of the volume of data and subtle distinctions in labelling, these types of tasks can be difficult for general purpose models.

Objective: SafetyKit aimed to replace multiple nodes in their most complex workflows with a single reasoning agent using a reinforcement fine-tuned model. The goal is to reduce SafetyKit’s time-to-market for novel policy enforcements even in challenging, nuanced domains.

Company: Thomson Reuters is an AI and technology company empowering professionals with trusted content and workflow automation.

Problem to solve: Legal professionals must read through large amounts of content before making any decisions. Thomson Reuter’s CoCounsel product is designed to help these experts move faster by providing an AI assistant with content and industry knowledge. The models that power this tool must understand complex legal rules.

Objective: Thomson Reuters aimed to create a reinforcement fine-tuned model excelling in legal AI skills. They conducted preliminary evaluations of RFT to see if they could achieve model performance improvements, using specialized datasets from three highly-used CoCounsel Legal AI skills for legal professionals:

  1. Review documents: Generates detailed answers to questions asked against contracts, transcripts, and other legal documents
  2. Compare documents: Highlights substantive differences between two or more different contracts or documents
  3. Summarize: Summarizes the most important information within one or more documents to enable rapid legal review

Evals are the foundation

Before implementing RFT, we strongly recommended creating and running an eval for the task you intend to fine-tune on. If the model you intend to fine-tune scores at either the absolute minimum or absolute maximum possible score, then RFT won’t be useful to you.

RFT works by reinforcing better answers to provided prompts. If we can’t distinguish the quality of different answers (i.e., if they all receive the minimum or maximum possible score), then there’s no training signal to learn from. However, if your eval scores somewhere in the range between the minimum and maximum possible scores, there’s enough data to work with.

An effective eval reveals opportunities where human experts consistently agree but current frontier models struggle, presenting a valuable gap for RFT to close. Get started with evals.

How to get better results from RFT

To see improvements in your fine-tuned model, there are two main places to revisit and refine: making sure your task is well defined, and making your grading scheme more robust.

Reframe or clarify your task

Good tasks give the model a fair chance to learn and let you quantify improvements.

  • Start with a task the model can already solve occasionally. RFT works by sampling many answers, keeping what looks best, and nudging the model toward those answers. If the model never gets the answer correct today, it cannot improve.
  • Make sure each answer can be graded. A grader must read an answer and produce a score without a person in the loop. We support multiple grader types, including custom Python graders and LLM judges. If you can’t write code to judge the answer with an available grader, RFT is not the right tool.
  • Remove doubt about the “right” answer. If two careful people often disagree on the solution, the task is too fuzzy. Rewrite the prompt, add context, or split the task into clearer parts until domain experts agree.
  • Limit lucky guesses. If the task is multiple choice with one obvious best pick, the model can win by chance. Add more classes, ask for short open‑ended text, or tweak the format so guessing is costly.

Strengthen your grader

Clear, robust grading schemes are essential for RFT.

  • Produce a smooth score, not a pass/fail stamp. A score that shifts gradually as answers improve provides a better training signal.
  • Guard against reward hacking. This happens when the model finds a shortcut that earns high scores without real skill.
  • Avoid skewed data. Datasets in which one label shows up most of the time invite the model to guess that label. Balance the set or up‑weight rare cases so the model must think.
  • Use an LLM judge when code falls short. For rich, open‑ended answers, have a separate OpenAI model grade your fine-tuned model’s answers. Make sure you:
    • Evaluate the judge: Run multiple candidate responses and correct answers through your LLM judge to ensure the grade returned is stable and aligned with preference.
    • Provide few-shot examples. Include great, fair, and poor answers in the prompt to improve the grader’s effectiveness.

Learn more about grader types.

Other resources

For more inspiration, visit the OpenAI Cookbook, which contains example code and links to third-party resources, or learn more about our models and reasoning capabilities: