Shepherd: A Critic for Language Model Generation

AI-generated keywords: Shepherd Language Model Critique Evaluation Annotation

AI-generated Key Points

  • Shepherd is a language model designed to critique responses and provide improvement suggestions.
  • It surpasses untuned models by identifying diverse errors and offering remedies.
  • Shepherd's critiques are equivalent or preferred compared to established models like ChatGPT.
  • Despite having only 7B parameters, Shepherd achieves an average win-rate of 53-87% compared to competitive alternatives when evaluated using GPT-4.
  • In human evaluation, Shepherd consistently outperforms other models and closely matches ChatGPT's performance on average.
  • An independent vendor (RWS Moravia) was employed for the quality of the feedback dataset used in training Shepherd.
  • Expert reviewers were chosen for meticulous and nuanced annotations in the annotation task.
  • Each example was annotated by one expert with human-in-the-loop quality assessment.
  • Postprocessing steps were taken after annotation to ensure high-quality data, including removing flagged examples and excluding unhelpful feedback related to certain error types.
  • The dataset ended up with a total of 1,317 high-quality examples.
  • The evaluation process for feedback on model-generated answers involves assigning scores based on the quality of the feedback using a scale ranging from 1 to 7.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Tianlu Wang, Ping Yu, Xiaoqing Ellen Tan, Sean O'Brien, Ramakanth Pasunuru, Jane Dwivedi-Yu, Olga Golovneva, Luke Zettlemoyer, Maryam Fazel-Zarandi, Asli Celikyilmaz

7 figures, 7 tables
License: CC BY 4.0

Abstract: As large language models improve, there is increasing interest in techniques that leverage these models' capabilities to refine their own outputs. In this work, we introduce Shepherd, a language model specifically tuned to critique responses and suggest refinements, extending beyond the capabilities of an untuned model to identify diverse errors and provide suggestions to remedy them. At the core of our approach is a high quality feedback dataset, which we curate from community feedback and human annotations. Even though Shepherd is small (7B parameters), its critiques are either equivalent or preferred to those from established models including ChatGPT. Using GPT-4 for evaluation, Shepherd reaches an average win-rate of 53-87% compared to competitive alternatives. In human evaluation, Shepherd strictly outperforms other models and on average closely ties with ChatGPT.

Submitted to arXiv on 08 Aug. 2023

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2308.04592v1

Shepherd is a language model specifically designed to critique responses and provide suggestions for improvement. It surpasses untuned models by identifying diverse errors and offering remedies. The model's critiques are either equivalent or preferred to those of established models like ChatGPT. Shepherd, despite having only 7B parameters, achieves an average win-rate of 53-87% compared to competitive alternatives when evaluated using GPT-4. In human evaluation, Shepherd consistently outperforms other models and closely matches ChatGPT's performance on average. To ensure the quality of the feedback dataset used in training Shepherd, an independent vendor (RWS Moravia) was employed instead of crowd-sourcing. Expert reviewers were chosen due to the demanding nature of the annotation task, which required meticulous and nuanced annotations. Each example was annotated by one expert with human-in-the-loop quality assessment. The annotation process involved defining different error types in a taxonomy table and providing detailed instructions and guidelines for human annotators. The specifics of these instructions can be found in Appendix A. After annotation, postprocessing steps were taken to ensure high-quality data. Examples flagged with "Errors in the correct output" and "The context is too complex to work on" were removed from the dataset. Feedback related to error types such as "Redundancy" and "Consistency with context" was also excluded as it was deemed unhelpful. This resulted in a total of 1,317 high-quality examples. The evaluation process for feedback on model-generated answers involves assigning scores based on the quality of the feedback. The scoring scale ranges from 1 to 7: 1-3: Incorrect judgment - When the answer is incorrect but the feedback incorrectly confirms its correctness or vice versa; 4-7: Correct judgment - When the feedback accurately confirms the correctness or incorrectness of the answer.
Created on 19 Oct. 2023

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.