SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

AI-generated keywords: Generalization Memorization Supervised Fine-Tuning Reinforcement Learning Foundation Models

AI-generated Key Points

  • Study by Tianzhe Chu et al. compares effects of supervised fine-tuning (SFT) and reinforcement learning (RL) on model generalization capabilities
  • Focus on text-based rule variants and visual variants to explore impact of post-training techniques on generalization and memorization
  • Introduce GeneralPoints card game and V-IRL navigation environment for assessment
  • RL excels at generalizing across rule-based textual and visual variants with outcome-based reward system
  • SFT tends to memorize training data, struggles with out-of-distribution scenarios
  • RL enhances model's visual recognition capabilities, leading to improved generalization in visual domain
  • SFT crucial for effective RL training as it stabilizes model's output format for subsequent improvements
  • RL acquires generalizable knowledge in complex multi-modal tasks, highlighting adaptability to diverse scenarios
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Shengbang Tong, Saining Xie, Dale Schuurmans, Quoc V. Le, Sergey Levine, Yi Ma

Website at https://tianzhechu.com/SFTvsRL
License: CC BY 4.0

Abstract: Supervised fine-tuning (SFT) and reinforcement learning (RL) are widely used post-training techniques for foundation models. However, their roles in enhancing model generalization capabilities remain unclear. This paper studies the difference between SFT and RL on generalization and memorization, focusing on text-based rule variants and visual variants. We introduce GeneralPoints, an arithmetic reasoning card game, and adopt V-IRL, a real-world navigation environment, to assess how models trained with SFT and RL generalize to unseen variants in both textual and visual domains. We show that RL, especially when trained with an outcome-based reward, generalizes across both rule-based textual and visual variants. SFT, in contrast, tends to memorize training data and struggles to generalize out-of-distribution scenarios. Further analysis reveals that RL improves the model's underlying visual recognition capabilities, contributing to its enhanced generalization in the visual domain. Despite RL's superior generalization, we show that SFT remains essential for effective RL training; SFT stabilizes the model's output format, enabling subsequent RL to achieve its performance gains. These findings demonstrates the capability of RL for acquiring generalizable knowledge in complex, multi-modal tasks.

Submitted to arXiv on 28 Jan. 2025

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2501.17161v1

In the study "SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training," Tianzhe Chu et al. investigate the effects of supervised fine-tuning (SFT) and reinforcement learning (RL) on model generalization capabilities. The researchers focus on text-based rule variants and visual variants to explore how these post-training techniques impact generalization and memorization. To assess their impact, they introduce GeneralPoints, an arithmetic reasoning card game, and utilize V-IRL, a real-world navigation environment. The results show that RL excels at generalizing across rule-based textual and visual variants when trained with an outcome-based reward system. In contrast, SFT tends to memorize training data and struggles with out-of-distribution scenarios. Further analysis reveals that RL enhances the model's underlying visual recognition capabilities, leading to improved generalization in the visual domain. While RL shows superior performance in generalization, SFT remains crucial for effective RL training as it stabilizes the model's output format for subsequent improvements. This study highlights RL's ability to acquire generalizable knowledge in complex multi-modal tasks and sheds light on how SFT and RL affect foundation models' adaptability to diverse scenarios.
Created on 12 Feb. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.