Model-based evaluation is crucial for successful model development. It serves as a reward model for training and a substitute for human evaluation. Traditionally, training evaluators involves collecting a large amount of human preference judgments over model responses. However, this can be costly and result in stale data as models improve. In an effort to enhance evaluators without relying on human annotations, this work introduces an innovative approach that utilizes synthetic training data exclusively. The process begins with unlabeled instructions and implements an iterative self-improvement scheme that generates contrasting model outputs. An LLM-as-a-Judge is then trained to provide reasoning traces and final judgments based on these outputs. The training process is repeated at each iteration using the improved predictions. Remarkably, the Self-Taught Evaluator achieves significant improvement in performance without any labeled preference data. It enhances a strong LLM (Llama3-70B-Instruct) from 75.4 to 88.3 (or 88.7 with majority vote) on RewardBench. This advancement surpasses the capabilities of commonly used LLM judges like GPT-4 and matches the performance of top-performing reward models trained with labeled examples. Additionally, the study explores the impact of combining synthetic preference data with human-labeled preference data on model performance. By merging synthetic preferences generated from WildChat prompts with human-labeled HelpSteer2 dataset, the models demonstrate strong performance across different mixing weights. Furthermore, an analysis of instruction complexity reveals insights into the length distribution of curated training sets compared to full datasets before instruction selection. The curated dataset showcases more complex instructions involving logical reasoning/science while the full dataset leans towards relationships and entertainment content. In conclusion, this scalable approach offers a robust solution for constructing a generalist evaluator to assess LLM outputs through synthetic preferences without human annotation reliance. The Self-Taught Evaluator's iterative training over synthetic preferences significantly enhances the accuracy levels of a strong seed LLM (Llama3-70B-Instruct), setting new benchmarks in generative LLM-as-a-Judge methods within the field of model-based evaluation research.
- - Model-based evaluation is crucial for successful model development
- - Traditional training evaluators involve collecting human preference judgments, which can be costly and result in stale data
- - An innovative approach using synthetic training data exclusively is introduced to enhance evaluators without relying on human annotations
- - The Self-Taught Evaluator achieves significant improvement in performance without any labeled preference data
- - Combining synthetic preference data with human-labeled preference data shows strong performance across different mixing weights
- - Analysis of instruction complexity reveals differences between curated datasets and full datasets before instruction selection
- - The scalable approach offers a robust solution for constructing a generalist evaluator to assess LLM outputs through synthetic preferences without human annotation reliance
Summary- Model-based evaluation is important for making models better.
- Traditional ways of evaluating models involve asking people for their opinions, which can be expensive and give old information.
- A new method uses made-up training data to improve evaluations without needing human input.
- The Self-Taught Evaluator makes models work better without using labeled preference data from people.
- Mixing fake and real preference data can make models perform well.
Definitions- Model-based evaluation: Checking how good a model is by testing it against certain criteria.
- Synthetic training data: Made-up information used to train or evaluate models instead of real data.
- Preference judgments: Opinions or choices made by people about what they like or prefer.
- Labeled preference data: Information that has been marked or identified with specific preferences by humans.
Model-based evaluation has become an essential aspect of successful model development in recent years. It serves as a reward model for training and a substitute for human evaluation, making it crucial for ensuring the accuracy and effectiveness of models. However, traditional methods of training evaluators involve collecting large amounts of human preference judgments over model responses, which can be costly and result in stale data as models continue to improve. In order to address this issue, researchers have introduced an innovative approach that utilizes synthetic training data exclusively.
This research paper titled "Self-Taught Evaluator: Iterative Self-Improvement through Synthetic Preferences" explores the use of synthetic preferences as a means to enhance evaluators without relying on human annotations. The process begins with unlabeled instructions and implements an iterative self-improvement scheme that generates contrasting model outputs. An LLM-as-a-Judge is then trained to provide reasoning traces and final judgments based on these outputs. The training process is repeated at each iteration using the improved predictions.
The results from this study are remarkable, with the Self-Taught Evaluator achieving significant improvement in performance without any labeled preference data. It enhances a strong LLM (Llama3-70B-Instruct) from 75.4 to 88.3 (or 88.7 with majority vote) on RewardBench, surpassing the capabilities of commonly used LLM judges like GPT-4 and matching the performance of top-performing reward models trained with labeled examples.
One key advantage of this approach is its scalability - it offers a robust solution for constructing a generalist evaluator to assess LLM outputs through synthetic preferences without relying on human annotations. This makes it applicable across various domains and tasks, providing a more efficient and cost-effective method for evaluating models.
The study also explores the impact of combining synthetic preference data with human-labeled preference data on model performance. By merging synthetic preferences generated from WildChat prompts with human-labeled HelpSteer2 dataset, the models demonstrate strong performance across different mixing weights. This highlights the potential of using a combination of synthetic and human-labeled data for training evaluators, providing more diverse and dynamic inputs for model evaluation.
Furthermore, an analysis of instruction complexity reveals insights into the length distribution of curated training sets compared to full datasets before instruction selection. The curated dataset showcases more complex instructions involving logical reasoning/science while the full dataset leans towards relationships and entertainment content. This suggests that incorporating a variety of instruction types can lead to better overall performance in model-based evaluation.
In conclusion, this research paper presents a novel approach to model-based evaluation that utilizes synthetic preferences as a substitute for human annotations. The Self-Taught Evaluator's iterative training over synthetic preferences significantly enhances the accuracy levels of a strong seed LLM (Llama3-70B-Instruct), setting new benchmarks in generative LLM-as-a-Judge methods within the field of model-based evaluation research. With its scalability and ability to improve without relying on human annotations, this approach has great potential for advancing model development and evaluation in various domains.