Large Language Models (LLMs) have revolutionized numerous fields, including reasoning and scientific domains like mathematics and coding. However, a crucial aspect for LLMs to truly excel in these areas is their ability to self-correct, meaning they can identify and rectify their errors without relying on external input or feedback. This is especially important in scenarios where obtaining ground truth or feedback is impractical or impossible.
This paper, goes deep into the challenge of intrinsic self-correction, where LLMs are expected to revise their initial responses to achieve a better final output solely based on their internal knowledge and reasoning.
from the paper:
This sort of self-correction capability has been shown by several recent works to be severely lacking in current LLMs, especially in the absence of external input (also referred to as intrinsic self-correction).
The paper aims to develop a method that can effectively teach LLMs to self-correct, leading to the following advantages:
The research on self-correcting LLMs draws upon several existing areas of research:
Previous studies have explored the use of prompting techniques to encourage LLMs to self-correct. However, these attempts have largely been unsuccessful, with self-correction often degrading performance or leading to minimal improvements. The paper explains:
"These experimental studies are at odds with prior work ... and largely stem from mismatched assumptions on the setting."
Another line of research focuses on fine-tuning LLMs specifically for self-correction. These methods often rely on either oracle feedback (e.g., human annotations or stronger models) or involve complex pipelines with multiple models, which are not ideal for real-world deployment.
The paper leverages the advancements in multi-turn Reinforcement Learning (RL) for LLMs, which allows models to learn from sequential interactions and optimize their actions over multiple turns. This framework is well-suited for self-correction as it involves a sequence of attempts to arrive at a better final response.
Several studies have investigated self-correction in settings where external feedback is available, such as code generation with unit test results. However, the paper emphasizes the more challenging setting of intrinsic self-correction, where no external input is provided.
The paper aims to develop a method for training LLMs to improve their predictions entirely by learning from self-generated data, specifically in the intrinsic self-correction setting.
The paper formalizes the problem using a multi-turn setting, where an LLM policy π<sub>θ</sub> takes as input the problem x, previous attempts at solving it (ŷ<sub>1:l</sub>), and auxiliary instructions (p<sub>1:l</sub>), such as "find a mistake and improve the response." The goal is to maximize the correctness reward of the final response (ŷ<sub>l+1</sub>) compared to the oracle response (y<sup>*</sup>).
The paper uses on-policy policy gradient methods for RL, which are commonly used for fine-tuning LLMs in single-turn settings with human feedback.
The paper uses several metrics to evaluate self-correction performance, including:
This section explores whether Supervised Fine-Tuning (SFT) on self-generated data can be an effective approach for training self-correction capabilities.
The paper examines two SFT approaches:
The results show that while SFT improves upon the base model's self-correction behavior, it still falls short of achieving a positive self-correction rate (Δ(t1, t2)). Both STaR and Pair-SFT tend to produce worse second attempts compared to their first attempts.
The paper further analyzes the limitations of SFT:
By analyzing edit distance ratios between first and second attempts, the paper confirms that SFT models are overly conservative in their edits, often making no changes at all.
The key takeaways from this section are:
SCoRe is a novel multi-turn RL approach designed to overcome the limitations of SFT and achieve effective self-correction.
While multi-turn RL addresses the distribution shift issue, it faces two key challenges:
Mode collapse
Base model initializations often exhibit a highly skewed distribution over edit distances, making them prone to collapse into a single mode of behavior (e.g., making no edits).
Learning the right strategy
Even with a better initialization, the RL training process needs to be guided towards learning a generalizable self-correction strategy rather than simply optimizing for the best first-attempt response.
SCoRe addresses these challenges using a two-stage approach:
Stage I: Training a Model Initialization to Prevent Collapse
This stage focuses on training a model initialization that is less susceptible to collapse in subsequent RL. Instead of using SFT, SCoRe fine-tunes the base model to produce high-reward revisions at the second attempt while constraining the first-attempt response distribution to be close to the base model. This ensures that the model learns to make meaningful edits without drastically deviating from its initial responses.
Stage II: Multi-Turn RL with Reward Shaping
This stage utilizes multi-turn RL to optimize reward at both attempts. To encourage self-correction, SCoRe employs reward shaping by adding a bonus to the second attempt's reward for transitions that flip the correctness of the response. This biases the model towards learning a strategy that actively improves upon the initial response.
The paper explains that naive multi-turn RL often fails to learn self-correction because there are two equally good solutions on the training data:
Overparameterized LLMs may not necessarily learn the desired strategy unless the "direct" strategy of optimizing the first attempt appears less viable. SCoRe's two-stage approach and reward shaping address this issue by biasing the learning process towards the desired self-correction strategy.
This section evaluates the performance of SCoRe on benchmark reasoning tasks and compares it with prior approaches and baselines.
The experiments focus on math and coding tasks:
The evaluation uses two sequential attempts at each problem to measure self-correction accuracy. The paper also uses the metrics defined in Section 3 to analyze the performance.
SCoRe is compared with:
MATH:
Code Generation:
The paper demonstrates that SCoRe can be effectively combined with inference-time compute scaling strategies like self-consistency decoding (majority voting). It shows that using sequential sampling with self-correction is more compute-efficient than parallel sampling alone.
The ablation studies on MATH demonstrate the importance of various components of SCoRe:
The qualitative analysis shows that SCoRe can refine its responses in various ways, including rewriting entire solutions, revising incorrect parts, and even demonstrating a bias towards showing more steps in computations to increase the probability of correctness.
SCoRe's ability to improve intrinsic self-correction capabilities in LLMs has significant implications for various business applications:
LLMs equipped with self-correction can provide more accurate and reliable responses in customer support interactions, reducing the need for human intervention.
SCoRe's strong performance on code generation and repair tasks opens doors for automating software development processes, improving efficiency and reducing costs.
Self-correction capabilities enhance the trustworthiness and reliability of AI assistants in various domains, including education, healthcare, and finance.
SCoRe's ability to learn from self-generated data can significantly reduce the need for expensive and time-consuming human feedback in LLM training.
The paper introduces SCoRe, a novel multi-turn RL approach that achieves significant improvements in intrinsic self-correction in LLMs. By carefully designing a two-stage training process with reward shaping, SCoRe overcomes the limitations of traditional SFT methods and enables LLMs to learn a generalizable self-correction strategy.
The results on benchmark reasoning tasks demonstrate the effectiveness of SCoRe in enhancing both direct and self-correction accuracies. The ablation studies further highlight the importance of various components of SCoRe in contributing to its success.
SCoRe's success stems from addressing two key challenges that hinder self-correction in LLMs:
Models don't learn self-correction during original training because the training objectives typically focus on generating a single correct response for a given input. There's no explicit incentive for the model to identify and fix its mistakes. Additionally, the massive datasets used for training may not contain sufficient examples of error correction to enable the model to learn this behavior effectively.
SCoRe, by explicitly focusing on multi-turn self-correction and employing tailored training strategies, provides the necessary guidance for LLMs to develop this crucial capability.