Large Language Models (LLMs) have shown remarkable capabilities in complex tasks requiring multi-step reasoning, such as mathematical problem-solving and embodied agent control. Reinforcement learning (RL) offers a promising avenue for self-improvement in LLMs, but many RL algorithms require costly online data collection. Offline RL methods, like Direct Preference Optimization (DPO), provide a more practical approach by utilizing pre-existing datasets. However, DPO has limitations in multi-step reasoning tasks, including the need for pairwise preference data and ineffective credit assignment due to uniform token treatment.
Reinforcement Learning for LLMs has become a standard approach in post-training. RLHF methods like PPO have been widely adopted. Alternative methods such as rejection sampling and preference-based RL have gained traction in the LLM literature. Maximum-entropy RL and PCL also play a key role in this field.
LLM Reasoning has been enhanced through methods such as chain-of-thought and supervised fine-tuning. Rejection sampling, STaR and other methods have improved reasoning in cases where human annotated trajectories are not available. While RL algorithms are being used to improve LLM reasoning, direct usage of DPO has not been entirely successful due to the need to collect pairwise preference data.
The paper defines a Markov Decision Process (MDP) for LLM reasoning where, at each time step, a new token is generated as the action. The state is represented as a token sequence. The transition function deterministically updates the state by concatenating new tokens. The reward function is typically non-zero only at the terminal step, indicating task completion.
Based on entropy-regularized reinforcement learning, the paper defines a value function that quantifies the expected KL-regularized reward of a policy from any given state. Theorem 1 states that the optimal policy and its value function satisfy the soft Bellman Equation. Theorem 2 states that if a policy and state value function satisfy a consistency property derived from the soft Bellman equation then they are the optimal policy and optimal value function.
The paper presents how DPO can be derived from the above formulation with additional assumptions. By telescoping time steps and using a Bradley-Terry preference model, the DPO loss function is derived. However, the paper notes that DPO introduces two challenges for multi-step reasoning: the need for pairwise data and the lack of fine-grained credit assignment.
OREO leverages the soft Bellman Equation by jointly training a policy and value function, focusing on the telescoped version of the soft Bellman Equation. This approach is similar to PCL. The MSE loss is adopted for the value network, while the policy objective is derived using the value function.
The paper presents two variants: step-level OREO and response-level OREO. Step-level OREO considers an entire reasoning step as an action, while response-level OREO mimics DPO, considering only the initial state in the action objective.
Offline LLM fine-tuning methods can be applied iteratively to improve performance. The updated policy model generates new data for each iteration. This is used for further training.
The value function estimates future reward and can be used to guide beam search. The method uses step-level beam search for math reasoning and selects the best-of-K actions for embodied agent control.
The method is evaluated on GSM8K and MATH for math reasoning and on ALFWorld for embodied agent control. GSM8K is a dataset of grade school math problems, while MATH is a dataset of competition-level math problems. ALFWorld provides interactive TextWorld environments for household tasks.
The base models used were: Qwen2.5-Math-1.5B and DeepSeekMath-7B-Instruct for the math reasoning tasks, and MiniCPM-2B-dpo-bf16 for the embodied agent task.
The method is compared to supervised fine-tuning and three other baselines: Rejection Sampling, DPO, and KTO.
On mathematical reasoning, OREO consistently outperforms all baselines across both GSM8K and MATH datasets and across model families. For Qwen-2.5-Math-1.5B, OREO achieves a 5.2% relative improvement over SFT on GSM8K and a 10.5% improvement on MATH. For DeepSeekMath-7B, OREO delivers meaningful improvements with relative gains of 3.6% on GSM8K and 5.1% on MATH. The results on the embodied control task (ALFWorld) show that OREO outperforms all baselines, especially in unseen settings (17.7% relative improvement over the baseline).
OREO demonstrates steady and consistent improvement in accuracy over multiple iterations, while baselines exhibit saturation. This is likely due to OREO's ability to learn from failed trajectories and incorporate those into the learning process.
OREO benefits from explicitly parameterizing a separate value function. The paper presents case studies to compare the explicit value function with the implicit value function derived from the policy model. The explicit value function is able to better distinguish correct and incorrect reasoning steps, especially in more challenging scenarios. The advantage function based on the value function also has a better ability to identify the correct reasoning step.
Using the explicit value function within the test-time search (e.g. beam search) leads to significant performance gains over greedy decoding. Beam search with B=7 gives a 11.4% improvement on GSM8K and 17.9% improvement on MATH. Similarly, on ALFWorld the success rates improve as K increases when selecting the best-of-K actions using the value function.
The potential of OREO and similar methods goes beyond the immediate task of enhancing language model reasoning. While the paper demonstrates significant improvements on benchmark datasets, the implications for the wider business world could be immense. Consider the automation of complex processes, where LLMs can analyze nuanced problems and devise step-by-step solutions; financial analysis, for example, or logistics, or any other area that demands critical thinking and complex planning.
For researchers, this work presents a new direction for exploring better training techniques that more accurately approximate and leverage a value function. The gap between policy and value networks is another area that could yield significant research insights. Further refinement of this method could be used to address the challenges with fine-tuning for LLMs in a much more computationally efficient manner.
OREO is an offline RL algorithm that enhances LLM reasoning and embodied agent tasks. It leverages soft Q-learning by training an explicit value function together with an LLM policy. This addresses the need for pairwise data in DPO and enables fine-grained credit assignment. The value function can also be used in test time search for better results. OREO consistently outperforms previous offline RLHF methods like DPO.
OREO's unique insight lies in its ability to address the limitations of DPO in multi-step reasoning tasks. It does this by directly optimizing the soft Bellman equation and thereby creating a more granular credit assignment mechanism. This approach enables the algorithm to understand what actions resulted in a specific outcome, which is crucial when only the final state is known. This means that unlike methods such as DPO that optimize for the entire trajectory, OREO is able to better understand the nuances of each step, and leverage those into its learning. This also removes the requirement for pairwise preference data which is normally hard to collect in complex reasoning tasks, making this method more practical and efficient for real world applications. Furthermore, having an explicit value function to guide search at test time, without requiring any complex heuristic designs or extensive data engineering, is another key factor for why this method works so well.