Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning

AI-generated keywords: Process Reward Models

AI-generated Key Points

  • Authors explore the use of process reward models (PRMs) to enhance reasoning in large language models
  • PRMs offer feedback at each step of a multi-step reasoning trace, improving credit assignment compared to outcome reward models (ORMs)
  • Designing process rewards effectively involves measuring progress by evaluating the change in likelihood of producing a correct response before and after taking a step
  • Progress should be evaluated under a prover policy distinct from the base policy, enhancing exploration during test-time search and online RL
  • Training process advantage verifiers (PAVs) to predict progress under provers shows higher accuracy and compute efficiency compared to ORMs
  • Online RL with dense rewards from PAVs achieves significant gains in sample efficiency and accuracy compared to ORMs
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Amrith Setlur, Chirag Nagpal, Adam Fisch, Xinyang Geng, Jacob Eisenstein, Rishabh Agarwal, Alekh Agarwal, Jonathan Berant, Aviral Kumar

License: CC BY 4.0

Abstract: A promising approach for improving reasoning in large language models is to use process reward models (PRMs). PRMs provide feedback at each step of a multi-step reasoning trace, potentially improving credit assignment over outcome reward models (ORMs) that only provide feedback at the final step. However, collecting dense, per-step human labels is not scalable, and training PRMs from automatically-labeled data has thus far led to limited gains. To improve a base policy by running search against a PRM or using it as dense rewards for reinforcement learning (RL), we ask: "How should we design process rewards?". Our key insight is that, to be effective, the process reward for a step should measure progress: a change in the likelihood of producing a correct response in the future, before and after taking the step, corresponding to the notion of step-level advantages in RL. Crucially, this progress should be measured under a prover policy distinct from the base policy. We theoretically characterize the set of good provers and our results show that optimizing process rewards from such provers improves exploration during test-time search and online RL. In fact, our characterization shows that weak prover policies can substantially improve a stronger base policy, which we also observe empirically. We validate our claims by training process advantage verifiers (PAVs) to predict progress under such provers, and show that compared to ORMs, test-time search against PAVs is $>8\%$ more accurate, and $1.5-5\times$ more compute-efficient. Online RL with dense rewards from PAVs enables one of the first results with $5-6\times$ gain in sample efficiency, and $>6\%$ gain in accuracy, over ORMs.

Submitted to arXiv on 10 Oct. 2024

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2410.08146v1

In their paper titled "Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning," authors Amrith Setlur, Chirag Nagpal, Adam Fisch, Xinyang Geng, Jacob Eisenstein, Rishabh Agarwal, Alekh Agarwal, Jonathan Berant, and Aviral Kumar explore the use of process reward models (PRMs) to enhance reasoning in large language models. PRMs offer feedback at each step of a multi-step reasoning trace and can improve credit assignment compared to outcome reward models (ORMs) that only provide feedback at the final step. However, collecting dense human labels for each step is not practical on a large scale. To address this challenge and improve the base policy through search against a PRM or using it as dense rewards for reinforcement learning (RL), the authors investigate how to design process rewards effectively. Their key insight is that the process reward for a step should measure progress by evaluating the change in likelihood of producing a correct response before and after taking the step. This aligns with the notion of step-level advantages in RL. Importantly, this progress should be evaluated under a prover policy that is distinct from the base policy. The authors theoretically characterize good provers and demonstrate that optimizing process rewards from such provers enhances exploration during test-time search and online RL. They find that even weak prover policies can significantly enhance a stronger base policy. To validate their claims, they train process advantage verifiers (PAVs) to predict progress under these provers. Comparing their approach to ORMs, test-time search against PAVs shows over 8% higher accuracy and 1.5-5 times more compute efficiency. Furthermore, online RL with dense rewards from PAVs achieves significant gains in sample efficiency (5-6 times) and accuracy (over 6%) compared to ORMs. Overall, this research highlights the importance of designing effective process rewards for improving reasoning in large language models and demonstrates significant advancements in exploration during test-time search and online RL through the use of automated process verifiers like PAVs.
Created on 11 Oct. 2024

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.