Training Language Models to Self-Correct via Reinforcement Learning

AI-generated keywords: Large Language Models Self-Correction Reinforcement Learning SCoRe Gemini 1.0 Pro

AI-generated Key Points

  • Self-correction is a crucial capability lacking in modern Large Language Models (LLMs)
  • Existing methods for training self-correction often require multiple models or supervision from a more advanced model
  • SCoRe is a multi-turn online reinforcement learning approach that enhances LLMs' self-correction ability using self-generated data
  • Traditional supervised fine-tuning on correction traces generated offline was insufficient for effective self-correction behavior
  • SCoRe trains on the model's own distribution of self-generated correction traces and implements appropriate regularization to guide the learning process
  • SCoRe involves an initial phase of RL on a base model followed by reward bonuses to amplify self-correction during training
  • When applied to Gemini 1.0 Pro and 1.5 Flash models, SCoRe demonstrated significant improvements in self-correction performance on MATH and HumanEval benchmarks
  • Reinforcement learning can effectively train language models in self-correction tasks without external supervision, achieving state-of-the-art results
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Aviral Kumar, Vincent Zhuang, Rishabh Agarwal, Yi Su, John D Co-Reyes, Avi Singh, Kate Baumli, Shariq Iqbal, Colton Bishop, Rebecca Roelofs, Lei M Zhang, Kay McKinney, Disha Shrivastava, Cosmin Paduraru, George Tucker, Doina Precup, Feryal Behbahani, Aleksandra Faust

License: CC BY 4.0

Abstract: Self-correction is a highly desirable capability of large language models (LLMs), yet it has consistently been found to be largely ineffective in modern LLMs. Existing approaches for training self-correction either require multiple models or rely on a more capable model or other forms of supervision. To this end, we develop a multi-turn online reinforcement learning (RL) approach, SCoRe, that significantly improves an LLM's self-correction ability using entirely self-generated data. To build SCoRe, we first show that variants of supervised fine-tuning (SFT) on offline model-generated correction traces are insufficient for instilling self-correction behavior. In particular, we observe that training via SFT either suffers from a distribution mismatch between the training data and the model's own responses or implicitly prefers only a certain mode of correction behavior that is often not effective at test time. SCoRe addresses these challenges by training under the model's own distribution of self-generated correction traces and using appropriate regularization to steer the learning process into learning a self-correction strategy that is effective at test time as opposed to simply fitting high-reward responses for a given prompt. This regularization prescribes running a first phase of RL on a base model to generate a policy initialization that is less susceptible to collapse and then using a reward bonus to amplify self-correction during training. When applied to Gemini 1.0 Pro and 1.5 Flash models, we find that SCoRe achieves state-of-the-art self-correction performance, improving the base models' self-correction by 15.6% and 9.1% respectively on the MATH and HumanEval benchmarks.

Submitted to arXiv on 19 Sep. 2024

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2409.12917v1

In the realm of Large Language Models (LLMs), self-correction is a crucial capability that has unfortunately been lacking in modern LLMs. Existing methods for training self-correction often require multiple models or supervision from a more advanced model. To address this issue, a team of researchers developed SCoRe, a multi-turn online reinforcement learning (RL) approach that enhances an LLM's self-correction ability using self-generated data. The researchers discovered that traditional supervised fine-tuning (SFT) on correction traces generated offline by the model was insufficient for instilling effective self-correction behavior. Training through SFT resulted in either a mismatch between the training data and the model's responses or favored specific correction behaviors that were not effective during testing. SCoRe overcomes these challenges by training on the model's own distribution of self-generated correction traces and implementing appropriate regularization to guide the learning process towards developing a successful self-correction strategy at test time, rather than merely fitting high-reward responses. SCoRe involves running an initial phase of RL on a base model to establish a policy initialization less prone to collapse, followed by utilizing reward bonuses to amplify self-correction during training. When applied to Gemini 1.0 Pro and 1.5 Flash models, SCoRe demonstrated significant improvements in self-correction performance, enhancing the base models' capabilities by 15.6% and 9.1% respectively on MATH and HumanEval benchmarks. Overall, this innovative approach showcases how reinforcement learning can be leveraged to train language models effectively in self-correction tasks without relying on external supervision or multiple models, ultimately achieving state-of-the-art results in enhancing LLMs' ability to correct themselves accurately and efficiently.
Created on 22 Sep. 2024

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.