January 21, 2025
8 mins

Offline Reinforcement Learning for LLM Multi-Step Reasoning

Explore OREO, a novel offline RL algorithm for enhancing LLM multi-step reasoning. Learn how it outperforms DPO with soft Bellman optimization and fine-grained credit assignment.
Paper Link
Header image
Weekly newsletter
No spam. Just the latest researches, and exclusive interviews in your inbox every week.
Read about our privacy policy.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Key Takeaways

  • OREO is an offline RL method designed for LLM multi-step reasoning, addressing the limitations of DPO.
  • It jointly learns a policy model and a value function using the soft Bellman Equation, reducing reliance on pairwise preference data.
  • OREO enables fine-grained credit assignment, critical for tasks with sparse rewards and multi-step dependencies.
  • Empirically, OREO outperforms existing offline learning methods on math reasoning (GSM8K, MATH) and embodied agent control (ALFWorld) tasks.
  • The learned value function can be leveraged for test-time search, further improving performance.

Introduction

Large Language Models (LLMs) have shown remarkable capabilities in complex tasks requiring multi-step reasoning, such as mathematical problem-solving and embodied agent control. Reinforcement learning (RL) offers a promising avenue for self-improvement in LLMs, but many RL algorithms require costly online data collection. Offline RL methods, like Direct Preference Optimization (DPO), provide a more practical approach by utilizing pre-existing datasets. However, DPO has limitations in multi-step reasoning tasks, including the need for pairwise preference data and ineffective credit assignment due to uniform token treatment.

Relevant Background Work

Reinforcement Learning for LLMs has become a standard approach in post-training. RLHF methods like PPO have been widely adopted. Alternative methods such as rejection sampling and preference-based RL have gained traction in the LLM literature. Maximum-entropy RL and PCL also play a key role in this field.

LLM Reasoning has been enhanced through methods such as chain-of-thought and supervised fine-tuning. Rejection sampling, STaR and other methods have improved reasoning in cases where human annotated trajectories are not available. While RL algorithms are being used to improve LLM reasoning, direct usage of DPO has not been entirely successful due to the need to collect pairwise preference data.

Preliminaries

MDP for LLM Reasoning

The paper defines a Markov Decision Process (MDP) for LLM reasoning where, at each time step, a new token is generated as the action. The state is represented as a token sequence. The transition function deterministically updates the state by concatenating new tokens. The reward function is typically non-zero only at the terminal step, indicating task completion.

Soft Bellman Equation

Based on entropy-regularized reinforcement learning, the paper defines a value function that quantifies the expected KL-regularized reward of a policy from any given state. Theorem 1 states that the optimal policy and its value function satisfy the soft Bellman Equation. Theorem 2 states that if a policy and state value function satisfy a consistency property derived from the soft Bellman equation then they are the optimal policy and optimal value function.

Connection to DPO

The paper presents how DPO can be derived from the above formulation with additional assumptions. By telescoping time steps and using a Bradley-Terry preference model, the DPO loss function is derived. However, the paper notes that DPO introduces two challenges for multi-step reasoning: the need for pairwise data and the lack of fine-grained credit assignment.

OREO: Offline Reasoning Optimization

Learning Objective

OREO leverages the soft Bellman Equation by jointly training a policy and value function, focusing on the telescoped version of the soft Bellman Equation. This approach is similar to PCL. The MSE loss is adopted for the value network, while the policy objective is derived using the value function.

Loss Variants

The paper presents two variants: step-level OREO and response-level OREO. Step-level OREO considers an entire reasoning step as an action, while response-level OREO mimics DPO, considering only the initial state in the action objective.

Iterative OREO

Offline LLM fine-tuning methods can be applied iteratively to improve performance. The updated policy model generates new data for each iteration. This is used for further training.

Test-Time Search with Value Function

The value function estimates future reward and can be used to guide beam search. The method uses step-level beam search for math reasoning and selects the best-of-K actions for embodied agent control.

Experiments

Datasets and Evaluation Metrics

The method is evaluated on GSM8K and MATH for math reasoning and on ALFWorld for embodied agent control. GSM8K is a dataset of grade school math problems, while MATH is a dataset of competition-level math problems. ALFWorld provides interactive TextWorld environments for household tasks.

Base Models

The base models used were: Qwen2.5-Math-1.5B and DeepSeekMath-7B-Instruct for the math reasoning tasks, and MiniCPM-2B-dpo-bf16 for the embodied agent task.

Baseline Methods

The method is compared to supervised fine-tuning and three other baselines: Rejection Sampling, DPO, and KTO.

Results

Main Results

On mathematical reasoning, OREO consistently outperforms all baselines across both GSM8K and MATH datasets and across model families. For Qwen-2.5-Math-1.5B, OREO achieves a 5.2% relative improvement over SFT on GSM8K and a 10.5% improvement on MATH. For DeepSeekMath-7B, OREO delivers meaningful improvements with relative gains of 3.6% on GSM8K and 5.1% on MATH. The results on the embodied control task (ALFWorld) show that OREO outperforms all baselines, especially in unseen settings (17.7% relative improvement over the baseline).

Iterative OREO

OREO demonstrates steady and consistent improvement in accuracy over multiple iterations, while baselines exhibit saturation. This is likely due to OREO's ability to learn from failed trajectories and incorporate those into the learning process.

Implicit vs Explicit Value Functions

OREO benefits from explicitly parameterizing a separate value function. The paper presents case studies to compare the explicit value function with the implicit value function derived from the policy model. The explicit value function is able to better distinguish correct and incorrect reasoning steps, especially in more challenging scenarios. The advantage function based on the value function also has a better ability to identify the correct reasoning step.

Test-Time Search with Value Functions

Using the explicit value function within the test-time search (e.g. beam search) leads to significant performance gains over greedy decoding. Beam search with B=7 gives a 11.4% improvement on GSM8K and 17.9% improvement on MATH. Similarly, on ALFWorld the success rates improve as K increases when selecting the best-of-K actions using the value function.

Business and Research Implications

The potential of OREO and similar methods goes beyond the immediate task of enhancing language model reasoning. While the paper demonstrates significant improvements on benchmark datasets, the implications for the wider business world could be immense. Consider the automation of complex processes, where LLMs can analyze nuanced problems and devise step-by-step solutions; financial analysis, for example, or logistics, or any other area that demands critical thinking and complex planning.

For researchers, this work presents a new direction for exploring better training techniques that more accurately approximate and leverage a value function. The gap between policy and value networks is another area that could yield significant research insights. Further refinement of this method could be used to address the challenges with fine-tuning for LLMs in a much more computationally efficient manner.

Conclusion

OREO is an offline RL algorithm that enhances LLM reasoning and embodied agent tasks. It leverages soft Q-learning by training an explicit value function together with an LLM policy. This addresses the need for pairwise data in DPO and enables fine-grained credit assignment. The value function can also be used in test time search for better results. OREO consistently outperforms previous offline RLHF methods like DPO.

What Makes This Work?

OREO's unique insight lies in its ability to address the limitations of DPO in multi-step reasoning tasks. It does this by directly optimizing the soft Bellman equation and thereby creating a more granular credit assignment mechanism. This approach enables the algorithm to understand what actions resulted in a specific outcome, which is crucial when only the final state is known. This means that unlike methods such as DPO that optimize for the entire trajectory, OREO is able to better understand the nuances of each step, and leverage those into its learning. This also removes the requirement for pairwise preference data which is normally hard to collect in complex reasoning tasks, making this method more practical and efficient for real world applications. Furthermore, having an explicit value function to guide search at test time, without requiring any complex heuristic designs or extensive data engineering, is another key factor for why this method works so well.

Share this post

Why Clio AI?

Unlock the most obvious-yet-hidden-in-plain-sight growth hack - enable your employees to work on important things, and reduce their cognitive load and time to resolve blockers.

Fast, efficient, and in-context information to make every employee a super performer.

Spend time thinking not searching. Get a demo today.

By signing up for a demo, you agree to our Privacy Policy.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.