Self-Taught Evaluators

AI-generated keywords: Model-based evaluation Synthetic training data Self-Taught Evaluator LLM-as-a-Judge Iterative training

AI-generated Key Points

  • Model-based evaluation is crucial for successful model development
  • Traditional training evaluators involve collecting human preference judgments, which can be costly and result in stale data
  • An innovative approach using synthetic training data exclusively is introduced to enhance evaluators without relying on human annotations
  • The Self-Taught Evaluator achieves significant improvement in performance without any labeled preference data
  • Combining synthetic preference data with human-labeled preference data shows strong performance across different mixing weights
  • Analysis of instruction complexity reveals differences between curated datasets and full datasets before instruction selection
  • The scalable approach offers a robust solution for constructing a generalist evaluator to assess LLM outputs through synthetic preferences without human annotation reliance
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Tianlu Wang, Ilia Kulikov, Olga Golovneva, Ping Yu, Weizhe Yuan, Jane Dwivedi-Yu, Richard Yuanzhe Pang, Maryam Fazel-Zarandi, Jason Weston, Xian Li

License: CC ZERO 1.0

Abstract: Model-based evaluation is at the heart of successful model development -- as a reward model for training, and as a replacement for human evaluation. To train such evaluators, the standard approach is to collect a large amount of human preference judgments over model responses, which is costly and the data becomes stale as models improve. In this work, we present an approach that aims to im-prove evaluators without human annotations, using synthetic training data only. Starting from unlabeled instructions, our iterative self-improvement scheme generates contrasting model outputs and trains an LLM-as-a-Judge to produce reasoning traces and final judgments, repeating this training at each new iteration using the improved predictions. Without any labeled preference data, our Self-Taught Evaluator can improve a strong LLM (Llama3-70B-Instruct) from 75.4 to 88.3 (88.7 with majority vote) on RewardBench. This outperforms commonly used LLM judges such as GPT-4 and matches the performance of the top-performing reward models trained with labeled examples.

Submitted to arXiv on 05 Aug. 2024

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2408.02666v1

Model-based evaluation is crucial for successful model development. It serves as a reward model for training and a substitute for human evaluation. Traditionally, training evaluators involves collecting a large amount of human preference judgments over model responses. However, this can be costly and result in stale data as models improve. In an effort to enhance evaluators without relying on human annotations, this work introduces an innovative approach that utilizes synthetic training data exclusively. The process begins with unlabeled instructions and implements an iterative self-improvement scheme that generates contrasting model outputs. An LLM-as-a-Judge is then trained to provide reasoning traces and final judgments based on these outputs. The training process is repeated at each iteration using the improved predictions. Remarkably, the Self-Taught Evaluator achieves significant improvement in performance without any labeled preference data. It enhances a strong LLM (Llama3-70B-Instruct) from 75.4 to 88.3 (or 88.7 with majority vote) on RewardBench. This advancement surpasses the capabilities of commonly used LLM judges like GPT-4 and matches the performance of top-performing reward models trained with labeled examples. Additionally, the study explores the impact of combining synthetic preference data with human-labeled preference data on model performance. By merging synthetic preferences generated from WildChat prompts with human-labeled HelpSteer2 dataset, the models demonstrate strong performance across different mixing weights. Furthermore, an analysis of instruction complexity reveals insights into the length distribution of curated training sets compared to full datasets before instruction selection. The curated dataset showcases more complex instructions involving logical reasoning/science while the full dataset leans towards relationships and entertainment content. In conclusion, this scalable approach offers a robust solution for constructing a generalist evaluator to assess LLM outputs through synthetic preferences without human annotation reliance. The Self-Taught Evaluator's iterative training over synthetic preferences significantly enhances the accuracy levels of a strong seed LLM (Llama3-70B-Instruct), setting new benchmarks in generative LLM-as-a-Judge methods within the field of model-based evaluation research.
Created on 22 Aug. 2024

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.