Training Language Models to Self-Correct via Reinforcement Learning

Key Takeaways

Intrinsic self-correction is hard: Large language models (LLMs) often struggle to identify and fix their own mistakes without external feedback.
Supervised Fine-Tuning (SFT) falls short: Traditional SFT methods, even on self-generated data, are insufficient for teaching robust self-correction, often leading to minimal edits or amplifying existing biases.
Reinforcement Learning (RL) offers a solution: SCoRe, a novel multi-turn RL approach, demonstrates significant improvements in intrinsic self-correction by learning directly from the model's own error-correction attempts.
SCoRe's secret sauce: A two-stage training process, involving a specialized initialization phase and reward shaping, helps SCoRe overcome the limitations of SFT and achieve state-of-the-art self-correction performance.
Broader implications: SCoRe suggests that teaching LLMs complex algorithmic behaviors like self-correction may require going beyond conventional SFT and single-turn RL paradigms.

Introduction

Large Language Models (LLMs) have revolutionized numerous fields, including reasoning and scientific domains like mathematics and coding. However, a crucial aspect for LLMs to truly excel in these areas is their ability to self-correct, meaning they can identify and rectify their errors without relying on external input or feedback. This is especially important in scenarios where obtaining ground truth or feedback is impractical or impossible.

This paper, goes deep into the challenge of intrinsic self-correction, where LLMs are expected to revise their initial responses to achieve a better final output solely based on their internal knowledge and reasoning.

from the paper:

This sort of self-correction capability has been shown by several recent works to be severely lacking in current LLMs, especially in the absence of external input (also referred to as intrinsic self-correction).

The paper aims to develop a method that can effectively teach LLMs to self-correct, leading to the following advantages:

Improved accuracy: By identifying and fixing mistakes, the overall accuracy of LLMs in reasoning and problem-solving tasks can be significantly enhanced.
Enhanced autonomy: Self-correction enables LLMs to operate more independently, reducing the reliance on external feedback or human intervention.
Better algorithm implementation: Self-correction is a stepping stone towards teaching LLMs to implement complex algorithms that require them to adapt and improve their responses iteratively.

Background

The research on self-correcting LLMs draws upon several existing areas of research:

Prompting for intrinsic self-correction

Previous studies have explored the use of prompting techniques to encourage LLMs to self-correct. However, these attempts have largely been unsuccessful, with self-correction often degrading performance or leading to minimal improvements. The paper explains:

"These experimental studies are at odds with prior work ... and largely stem from mismatched assumptions on the setting."

Fine-tuning for intrinsic self-correction

Another line of research focuses on fine-tuning LLMs specifically for self-correction. These methods often rely on either oracle feedback (e.g., human annotations or stronger models) or involve complex pipelines with multiple models, which are not ideal for real-world deployment.

Multi-turn RL for LLMs

The paper leverages the advancements in multi-turn Reinforcement Learning (RL) for LLMs, which allows models to learn from sequential interactions and optimize their actions over multiple turns. This framework is well-suited for self-correction as it involves a sequence of attempts to arrive at a better final response.

Self-correction with external feedback

Several studies have investigated self-correction in settings where external feedback is available, such as code generation with unit test results. However, the paper emphasizes the more challenging setting of intrinsic self-correction, where no external input is provided.

Preliminaries and Problem Setup

Goal

The paper aims to develop a method for training LLMs to improve their predictions entirely by learning from self-generated data, specifically in the intrinsic self-correction setting.

Formalism

The paper formalizes the problem using a multi-turn setting, where an LLM policy πθ takes as input the problem x, previous attempts at solving it (ŷ1:l), and auxiliary instructions (p1:l), such as "find a mistake and improve the response." The goal is to maximize the correctness reward of the final response (ŷl+1) compared to the oracle response (y*).

Base RL Approach

The paper uses on-policy policy gradient methods for RL, which are commonly used for fine-tuning LLMs in single-turn settings with human feedback.

Metrics

The paper uses several metrics to evaluate self-correction performance, including:

Accuracy@t1/t2: Model accuracy at the first and second attempts.
Δ(t1, t2): Net improvement in accuracy between attempts.
Δi→c(t1, t2): Fraction of problems that become correct from incorrect.
Δc→i(t1, t2): Fraction of problems that become incorrect from correct.

Supervised Fine-Tuning on Self-Generated Data is Insufficient for Self-Correction

This section explores whether Supervised Fine-Tuning (SFT) on self-generated data can be an effective approach for training self-correction capabilities.

Analysis Setup: Methods and Dataset Construction

The paper examines two SFT approaches:

STaR: This method filters self-correction trajectories to keep only those that successfully revise incorrect responses and then fine-tunes the model on this filtered dataset.
Pair-SFT: This method constructs a dataset by pairing incorrect responses with correct ones from the base model's self-generated data and then fine-tunes a single model on this dataset.

Empirical Findings

The results show that while SFT improves upon the base model's self-correction behavior, it still falls short of achieving a positive self-correction rate (Δ(t1, t2)). Both STaR and Pair-SFT tend to produce worse second attempts compared to their first attempts.

The paper further analyzes the limitations of SFT:

STaR: It tends to latch onto a single mode of correction, often making only minor edits to the initial response.
Pair-SFT: It suffers from distribution shift, meaning the correction strategy learned on the training data doesn't generalize well to the model's own distribution of initial responses.

Diving deeper: analyzing self-correction behavior

By analyzing edit distance ratios between first and second attempts, the paper confirms that SFT models are overly conservative in their edits, often making no changes at all.

Takeaways: Insufficiency of SFT

The key takeaways from this section are:

STaR: Fails to learn a diverse set of correction behaviors, often leading to minimal edits.
Pair-SFT: Suffers from distribution shift, hindering its ability to generalize to new problems.

SCoRe: Self-Correction via Multi-Turn Reinforcement Learning

SCoRe is a novel multi-turn RL approach designed to overcome the limitations of SFT and achieve effective self-correction.

Key Challenges

While multi-turn RL addresses the distribution shift issue, it faces two key challenges:

Mode collapse

‍Base model initializations often exhibit a highly skewed distribution over edit distances, making them prone to collapse into a single mode of behavior (e.g., making no edits).

Learning the right strategy

‍Even with a better initialization, the RL training process needs to be guided towards learning a generalizable self-correction strategy rather than simply optimizing for the best first-attempt response.

Method Overview

SCoRe addresses these challenges using a two-stage approach:

Stage I: Training a Model Initialization to Prevent Collapse

This stage focuses on training a model initialization that is less susceptible to collapse in subsequent RL. Instead of using SFT, SCoRe fine-tunes the base model to produce high-reward revisions at the second attempt while constraining the first-attempt response distribution to be close to the base model. This ensures that the model learns to make meaningful edits without drastically deviating from its initial responses.

Stage II: Multi-Turn RL with Reward Shaping

This stage utilizes multi-turn RL to optimize reward at both attempts. To encourage self-correction, SCoRe employs reward shaping by adding a bonus to the second attempt's reward for transitions that flip the correctness of the response. This biases the model towards learning a strategy that actively improves upon the initial response.

Why does naive multi turn fail?

The paper explains that naive multi-turn RL often fails to learn self-correction because there are two equally good solutions on the training data:

Learning to improve from the first to the second attempt (desired).
Producing the best first-attempt response followed by no correction (undesired).

Overparameterized LLMs may not necessarily learn the desired strategy unless the "direct" strategy of optimizing the first attempt appears less viable. SCoRe's two-stage approach and reward shaping address this issue by biasing the learning process towards the desired self-correction strategy.

Experimental Evaluation

This section evaluates the performance of SCoRe on benchmark reasoning tasks and compares it with prior approaches and baselines.

Tasks

The experiments focus on math and coding tasks:

MATH: A dataset of mathematical problem-solving questions.
MBPP & HumanEval: Datasets for evaluating code generation capabilities.

Evaluation protocol and metrics

The evaluation uses two sequential attempts at each problem to measure self-correction accuracy. The paper also uses the metrics defined in Section 3 to analyze the performance.

Prior approaches and comparisons

SCoRe is compared with:

Self-Refine: A prompting-based approach.
Pair-SFT & Multi-turn STaR: Fine-tuning-based approaches.

Benchmark Results

MATH:

SCoRe achieves significantly stronger performance on both direct and self-correction accuracies compared to all baselines.
It attains the first significantly positive self-correction gain (Δ(t1, t2)).
It improves the rate at which it fixes incorrect answers while reducing the proportion of correct answers it changes.

Code Generation:

SCoRe demonstrates strong offline repair performance on MBPP-R and generalizes well to HumanEval.
It achieves a substantial improvement in intrinsic self-correction delta compared to the base model.
Pair-SFT, while effective on MBPP-R, degrades the base model in the self-correction setting, highlighting the importance of on-policy sampling.

Inference-Compute Scaling with Self-Correction

The paper demonstrates that SCoRe can be effectively combined with inference-time compute scaling strategies like self-consistency decoding (majority voting). It shows that using sequential sampling with self-correction is more compute-efficient than parallel sampling alone.

Ablation Studies: Understanding the Impact of SCoRe Components

The ablation studies on MATH demonstrate the importance of various components of SCoRe:

Multi-turn training is crucial for achieving positive Δ(t1, t2).
Stage I plays a significant role in preventing collapse and improving accuracy@t2.
Reward shaping is essential for guiding the RL process towards self-correction.
On-policy RL (REINFORCE) is more effective than STaR in the self-correction setting.

Qualitative Analysis of SCoRe

The qualitative analysis shows that SCoRe can refine its responses in various ways, including rewriting entire solutions, revising incorrect parts, and even demonstrating a bias towards showing more steps in computations to increase the probability of correctness.

Business Implications

SCoRe's ability to improve intrinsic self-correction capabilities in LLMs has significant implications for various business applications:

Enhanced customer support

‍LLMs equipped with self-correction can provide more accurate and reliable responses in customer support interactions, reducing the need for human intervention.

Automated code generation and repair

SCoRe's strong performance on code generation and repair tasks opens doors for automating software development processes, improving efficiency and reducing costs.

More robust and trustworthy AI assistants

Self-correction capabilities enhance the trustworthiness and reliability of AI assistants in various domains, including education, healthcare, and finance.

Reduced reliance on human feedback

‍SCoRe's ability to learn from self-generated data can significantly reduce the need for expensive and time-consuming human feedback in LLM training.

Conclusion

The paper introduces SCoRe, a novel multi-turn RL approach that achieves significant improvements in intrinsic self-correction in LLMs. By carefully designing a two-stage training process with reward shaping, SCoRe overcomes the limitations of traditional SFT methods and enables LLMs to learn a generalizable self-correction strategy.

The results on benchmark reasoning tasks demonstrate the effectiveness of SCoRe in enhancing both direct and self-correction accuracies. The ablation studies further highlight the importance of various components of SCoRe in contributing to its success.

Why does SCoRe work? Why don't models learn self-correction at original training time?

SCoRe's success stems from addressing two key challenges that hinder self-correction in LLMs:

Distribution shift: SCoRe uses on-policy RL to train on self-generated data, ensuring that the learned correction strategy aligns with the model's own distribution of initial responses. This contrasts with SFT approaches that suffer from distribution mismatch when trained on offline datasets.
Learning the right strategy: SCoRe's two-stage training and reward shaping encourage the model to learn a generalizable self-correction strategy rather than simply optimizing for the best first-attempt response. This prevents the model from collapsing into a non-correcting mode of behavior.

Models don't learn self-correction during original training because the training objectives typically focus on generating a single correct response for a given input. There's no explicit incentive for the model to identify and fix its mistakes. Additionally, the massive datasets used for training may not contain sufficient examples of error correction to enable the model to learn this behavior effectively.

SCoRe, by explicitly focusing on multi-turn self-correction and employing tailored training strategies, provides the necessary guidance for LLMs to develop this crucial capability.

Share this post

Why Clio AI?

Simple Out-of-the-box solution

Grounded in your company and your people

Turnkey enterprise grade offering

Spend time thinking not searching. Get a demo today.