IPO: Your Language Model is Secretly a Preference Classifier
AI-generated Key Points
- Implicit Preference Optimization (IPO) is a novel approach in reinforcement learning from human feedback (RLHF)
- IPO aims to reduce reliance on external feedback or reward models by using generative LLMs as preference classifiers
- IPO achieved comparable performance to state-of-the-art reward models in obtaining preferences, as shown in a comprehensive evaluation using RewardBench
- IPO outperformed the self-rewarding approach, especially in smaller models, demonstrating robustness and consistency across various tasks and model sizes
- Instruction-based fine-tuning was effective as preference classifiers, with Qwen being a top performer among code-specific models
- Overall, IPO offers a promising alternative approach to RLHF with high performance levels across different tasks and model sizes
Authors: Shivank Garg, Ayush Singh, Shweta Singh, Paras Chopra
Abstract: Reinforcement learning from human feedback (RLHF) has emerged as the primary method for aligning large language models (LLMs) with human preferences. While it enables LLMs to achieve human-level alignment, it often incurs significant computational and financial costs due to its reliance on training external reward models or human-labeled preferences. In this work, we propose \textbf{Implicit Preference Optimization (IPO)}, an alternative approach that leverages generative LLMs as preference classifiers, thereby reducing the dependence on external human feedback or reward models to obtain preferences. We conduct a comprehensive evaluation on the preference classification ability of LLMs using RewardBench, assessing models across different sizes, architectures, and training levels to validate our hypothesis. Furthermore, we investigate the self-improvement capabilities of LLMs by generating multiple responses for a given instruction and employing the model itself as a preference classifier for Direct Preference Optimization (DPO)-based training. Our findings demonstrate that models trained through IPO achieve performance comparable to those utilizing state-of-the-art reward models for obtaining preferences.
Ask questions about this paper to our AI assistant
You can also chat with multiple papers at once here.
Assess the quality of the AI-generated content by voting
Score: 1
Why do we need votes?
Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.
Similar papers summarized with our AI tools
Navigate through even more similar papers through a
tree representationLook for similar papers (in beta version)
By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.
Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.