Shepherd is a language model specifically designed to critique responses and provide suggestions for improvement. It surpasses untuned models by identifying diverse errors and offering remedies. The model's critiques are either equivalent or preferred to those of established models like ChatGPT. Shepherd, despite having only 7B parameters, achieves an average win-rate of 53-87% compared to competitive alternatives when evaluated using GPT-4. In human evaluation, Shepherd consistently outperforms other models and closely matches ChatGPT's performance on average. To ensure the quality of the feedback dataset used in training Shepherd, an independent vendor (RWS Moravia) was employed instead of crowd-sourcing. Expert reviewers were chosen due to the demanding nature of the annotation task, which required meticulous and nuanced annotations. Each example was annotated by one expert with human-in-the-loop quality assessment. The annotation process involved defining different error types in a taxonomy table and providing detailed instructions and guidelines for human annotators. The specifics of these instructions can be found in Appendix A. After annotation, postprocessing steps were taken to ensure high-quality data. Examples flagged with "Errors in the correct output" and "The context is too complex to work on" were removed from the dataset. Feedback related to error types such as "Redundancy" and "Consistency with context" was also excluded as it was deemed unhelpful. This resulted in a total of 1,317 high-quality examples. The evaluation process for feedback on model-generated answers involves assigning scores based on the quality of the feedback. The scoring scale ranges from 1 to 7: 1-3: Incorrect judgment - When the answer is incorrect but the feedback incorrectly confirms its correctness or vice versa; 4-7: Correct judgment - When the feedback accurately confirms the correctness or incorrectness of the answer.
- - Shepherd is a language model designed to critique responses and provide improvement suggestions.
- - It surpasses untuned models by identifying diverse errors and offering remedies.
- - Shepherd's critiques are equivalent or preferred compared to established models like ChatGPT.
- - Despite having only 7B parameters, Shepherd achieves an average win-rate of 53-87% compared to competitive alternatives when evaluated using GPT-4.
- - In human evaluation, Shepherd consistently outperforms other models and closely matches ChatGPT's performance on average.
- - An independent vendor (RWS Moravia) was employed for the quality of the feedback dataset used in training Shepherd.
- - Expert reviewers were chosen for meticulous and nuanced annotations in the annotation task.
- - Each example was annotated by one expert with human-in-the-loop quality assessment.
- - Postprocessing steps were taken after annotation to ensure high-quality data, including removing flagged examples and excluding unhelpful feedback related to certain error types.
- - The dataset ended up with a total of 1,317 high-quality examples.
- - The evaluation process for feedback on model-generated answers involves assigning scores based on the quality of the feedback using a scale ranging from 1 to 7.
Shepherd is a smart computer program that helps people improve their writing. It can find mistakes in what you write and give suggestions on how to make it better. Shepherd is really good at finding different kinds of mistakes and giving helpful advice. Even though Shepherd has less information than other programs, it still does a great job compared to them. People who are experts in writing helped train Shepherd by giving feedback on its suggestions. They made sure the feedback was really good quality by checking it carefully. The experts looked at lots of examples and gave scores from 1 to 7 to show how good the feedback was."
Definitions- Language model: A computer program that understands and generates human language.
- Critique: To point out mistakes or suggest improvements.
- Parameters: Information or settings used by a computer program to make decisions.
- Evaluation: The process of judging or assessing something based on certain criteria.
- Dataset: A collection of data used for training or testing a computer program.
- Annotations: Notes or comments added to provide extra information or explanation.
- Postprocessing: Additional steps taken after an initial process to improve the quality of the result.
- Feedback: Comments or suggestions given to help someone improve their work.
- Scale: A system for measuring or rating something using numbers.
Shepherd: A Language Model for Critiquing Responses and Providing Suggestions for Improvement
In recent years, language models have become increasingly powerful tools for natural language processing. However, most of these models are untuned and lack the ability to provide feedback on responses. Shepherd is a new language model that has been specifically designed to critique responses and offer suggestions for improvement. This breakthrough technology surpasses existing untuned models by identifying diverse errors and offering remedies.
Performance Evaluation
To evaluate the performance of Shepherd, it was compared with competitive alternatives such as ChatGPT using GPT-4. The results showed that despite having only 7B parameters, Shepherd achieved an average win-rate of 53-87% when evaluated against other models. Furthermore, in human evaluation studies, Shepherd consistently outperformed other models and closely matched ChatGPT's performance on average.
Data Quality Assurance
To ensure the quality of the feedback dataset used in training Shepherd, an independent vendor (RWS Moravia) was employed instead of crowd-sourcing. Expert reviewers were chosen due to the demanding nature of the annotation task which required meticulous and nuanced annotations. Each example was annotated by one expert with human-in-the-loop quality assessment following a taxonomy table which defined different error types along with detailed instructions and guidelines for annotators (see Appendix A). After annotation, postprocessing steps were taken to remove examples flagged with "Errors in the correct output" or "The context is too complex to work on" as well as feedback related to error types such as "Redundancy" or "Consistency with context". This resulted in a total of 1,317 high-quality examples being used for training purposes.
Evaluation Process
The evaluation process for feedback on model generated answers involves assigning scores based on their quality ranging from 1 to 7: 1 - 3 indicates incorrect judgment when either the answer is incorrect but its correctness is incorrectly confirmed or vice versa; 4 - 7 indicates correct judgment when feedback accurately confirms whether an answer is right or wrong respectively.
Conclusion
Shepherd has proven itself capable of providing accurate critiques on responses while also offering suggestions for improvement thanks to its carefully curated data set created through rigorous annotation processes carried out by experts rather than crowdsourcing methods commonly used by other language models today. It surpassed existing untuned models by achieving an impressive win rate between 53 - 87% when evaluated against competitive alternatives like ChatGPT while also performing better than them during human evaluations tests where it closely matched their performance overall