Shepherd: A Critic for Language Model Generation

AI-generated keywords: Shepherd Language Model Critique Evaluation Annotation

AI-generated Key Points

Shepherd is a language model designed to critique responses and provide improvement suggestions.
It surpasses untuned models by identifying diverse errors and offering remedies.
Shepherd's critiques are equivalent or preferred compared to established models like ChatGPT.
Despite having only 7B parameters, Shepherd achieves an average win-rate of 53-87% compared to competitive alternatives when evaluated using GPT-4.
In human evaluation, Shepherd consistently outperforms other models and closely matches ChatGPT's performance on average.
An independent vendor (RWS Moravia) was employed for the quality of the feedback dataset used in training Shepherd.
Expert reviewers were chosen for meticulous and nuanced annotations in the annotation task.
Each example was annotated by one expert with human-in-the-loop quality assessment.
Postprocessing steps were taken after annotation to ensure high-quality data, including removing flagged examples and excluding unhelpful feedback related to certain error types.
The dataset ended up with a total of 1,317 high-quality examples.
The evaluation process for feedback on model-generated answers involves assigning scores based on the quality of the feedback using a scale ranging from 1 to 7.

Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Tianlu Wang, Ping Yu, Xiaoqing Ellen Tan, Sean O'Brien, Ramakanth Pasunuru, Jane Dwivedi-Yu, Olga Golovneva, Luke Zettlemoyer, Maryam Fazel-Zarandi, Asli Celikyilmaz

arXiv: 2308.04592v1 - DOI (cs.CL)

7 figures, 7 tables

License: CC BY 4.0

Abstract: As large language models improve, there is increasing interest in techniques that leverage these models' capabilities to refine their own outputs. In this work, we introduce Shepherd, a language model specifically tuned to critique responses and suggest refinements, extending beyond the capabilities of an untuned model to identify diverse errors and provide suggestions to remedy them. At the core of our approach is a high quality feedback dataset, which we curate from community feedback and human annotations. Even though Shepherd is small (7B parameters), its critiques are either equivalent or preferred to those from established models including ChatGPT. Using GPT-4 for evaluation, Shepherd reaches an average win-rate of 53-87% compared to competitive alternatives. In human evaluation, Shepherd strictly outperforms other models and on average closely ties with ChatGPT.

Submitted to arXiv on 08 Aug. 2023

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2308.04592v1

Comprehensive Summary
Key points
Layman's Summary
Blog article

Shepherd is a language model specifically designed to critique responses and provide suggestions for improvement. It surpasses untuned models by identifying diverse errors and offering remedies. The model's critiques are either equivalent or preferred to those of established models like ChatGPT. Shepherd, despite having only 7B parameters, achieves an average win-rate of 53-87% compared to competitive alternatives when evaluated using GPT-4. In human evaluation, Shepherd consistently outperforms other models and closely matches ChatGPT's performance on average. To ensure the quality of the feedback dataset used in training Shepherd, an independent vendor (RWS Moravia) was employed instead of crowd-sourcing. Expert reviewers were chosen due to the demanding nature of the annotation task, which required meticulous and nuanced annotations. Each example was annotated by one expert with human-in-the-loop quality assessment. The annotation process involved defining different error types in a taxonomy table and providing detailed instructions and guidelines for human annotators. The specifics of these instructions can be found in Appendix A. After annotation, postprocessing steps were taken to ensure high-quality data. Examples flagged with "Errors in the correct output" and "The context is too complex to work on" were removed from the dataset. Feedback related to error types such as "Redundancy" and "Consistency with context" was also excluded as it was deemed unhelpful. This resulted in a total of 1,317 high-quality examples. The evaluation process for feedback on model-generated answers involves assigning scores based on the quality of the feedback. The scoring scale ranges from 1 to 7: 1-3: Incorrect judgment - When the answer is incorrect but the feedback incorrectly confirms its correctness or vice versa; 4-7: Correct judgment - When the feedback accurately confirms the correctness or incorrectness of the answer.

- Shepherd is a language model designed to critique responses and provide improvement suggestions.
- It surpasses untuned models by identifying diverse errors and offering remedies.
- Shepherd's critiques are equivalent or preferred compared to established models like ChatGPT.
- Despite having only 7B parameters, Shepherd achieves an average win-rate of 53-87% compared to competitive alternatives when evaluated using GPT-4.
- In human evaluation, Shepherd consistently outperforms other models and closely matches ChatGPT's performance on average.
- An independent vendor (RWS Moravia) was employed for the quality of the feedback dataset used in training Shepherd.
- Expert reviewers were chosen for meticulous and nuanced annotations in the annotation task.
- Each example was annotated by one expert with human-in-the-loop quality assessment.
- Postprocessing steps were taken after annotation to ensure high-quality data, including removing flagged examples and excluding unhelpful feedback related to certain error types.
- The dataset ended up with a total of 1,317 high-quality examples.
- The evaluation process for feedback on model-generated answers involves assigning scores based on the quality of the feedback using a scale ranging from 1 to 7.

Shepherd is a smart computer program that helps people improve their writing. It can find mistakes in what you write and give suggestions on how to make it better. Shepherd is really good at finding different kinds of mistakes and giving helpful advice. Even though Shepherd has less information than other programs, it still does a great job compared to them. People who are experts in writing helped train Shepherd by giving feedback on its suggestions. They made sure the feedback was really good quality by checking it carefully. The experts looked at lots of examples and gave scores from 1 to 7 to show how good the feedback was." Definitions- Language model: A computer program that understands and generates human language. - Critique: To point out mistakes or suggest improvements. - Parameters: Information or settings used by a computer program to make decisions. - Evaluation: The process of judging or assessing something based on certain criteria. - Dataset: A collection of data used for training or testing a computer program. - Annotations: Notes or comments added to provide extra information or explanation. - Postprocessing: Additional steps taken after an initial process to improve the quality of the result. - Feedback: Comments or suggestions given to help someone improve their work. - Scale: A system for measuring or rating something using numbers.

Shepherd: A Language Model for Critiquing Responses and Providing Suggestions for Improvement

In recent years, language models have become increasingly powerful tools for natural language processing. However, most of these models are untuned and lack the ability to provide feedback on responses. Shepherd is a new language model that has been specifically designed to critique responses and offer suggestions for improvement. This breakthrough technology surpasses existing untuned models by identifying diverse errors and offering remedies.

Performance Evaluation

To evaluate the performance of Shepherd, it was compared with competitive alternatives such as ChatGPT using GPT-4. The results showed that despite having only 7B parameters, Shepherd achieved an average win-rate of 53-87% when evaluated against other models. Furthermore, in human evaluation studies, Shepherd consistently outperformed other models and closely matched ChatGPT's performance on average.

Data Quality Assurance

To ensure the quality of the feedback dataset used in training Shepherd, an independent vendor (RWS Moravia) was employed instead of crowd-sourcing. Expert reviewers were chosen due to the demanding nature of the annotation task which required meticulous and nuanced annotations. Each example was annotated by one expert with human-in-the-loop quality assessment following a taxonomy table which defined different error types along with detailed instructions and guidelines for annotators (see Appendix A). After annotation, postprocessing steps were taken to remove examples flagged with "Errors in the correct output" or "The context is too complex to work on" as well as feedback related to error types such as "Redundancy" or "Consistency with context". This resulted in a total of 1,317 high-quality examples being used for training purposes.

Evaluation Process

The evaluation process for feedback on model generated answers involves assigning scores based on their quality ranging from 1 to 7: 1 - 3 indicates incorrect judgment when either the answer is incorrect but its correctness is incorrectly confirmed or vice versa; 4 - 7 indicates correct judgment when feedback accurately confirms whether an answer is right or wrong respectively.

Conclusion

Shepherd has proven itself capable of providing accurate critiques on responses while also offering suggestions for improvement thanks to its carefully curated data set created through rigorous annotation processes carried out by experts rather than crowdsourcing methods commonly used by other language models today. It surpassed existing untuned models by achieving an impressive win rate between 53 - 87% when evaluated against competitive alternatives like ChatGPT while also performing better than them during human evaluations tests where it closely matched their performance overall

Created on 19 Oct. 2023

Assess the quality of the AI-generated content by voting

Score: 0

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

Similar papers summarized with our AI tools

67.9%

Training a Helpful and Harmless Assistant with Reinforcement Learning from Hu…

cs.CL

66.0%

LIMA: Less Is More for Alignment

cs.CL

63.8%

Instruction Tuning with GPT-4

cs.CL

63.7%

Self-Alignment with Instruction Backtranslation

cs.CL

62.0%

Check Your Facts and Try Again: Improving Large Language Models with External…

cs.CL

62.0%

Self-critiquing models for assisting human evaluators

cs.CL

61.0%

Creating Large Language Model Resistant Exams: Guidelines and Strategies

cs.CL

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.