BotChat: Evaluating LLMs' Capabilities of Having Multi-Turn Dialogues
AI-generated Key Points
⚠The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.
- Study title: "BotChat: Evaluating LLMs' Capabilities of Having Multi-Turn Dialogues"
- Authors: Haodong Duan, Jueqi Wei, Chonghua Wang, Hongwei Liu, Yixiao Fang, Songyang Zhang, Dahua Lin, and Kai Chen
- Challenge of manually evaluating large language models (LLMs) in multi-turn dialogues
- Proposal of an innovative LLM-based approach for assessing performance in human-style chatting scenarios
- Methodology involving using real-world human dialogues to create ChatSEED for LLMs to generate complete multi-turn dialogues
- Use of GPT-4 as a judge to evaluate dialogue quality
- GPT-4's proficiency in producing human-style multi-turn dialogues surpassing other models
- Difficulty for discriminators to differentiate between GPT-4 generated dialogues and authentic human interactions
- Challenges faced by other LLMs in generating satisfactory multi-turn dialogues due to issues like poor instruction-following abilities or lengthy utterances
- Availability of data and codes on GitHub for further exploration and evaluation of conversational capabilities of large language models
Authors: Haodong Duan, Jueqi Wei, Chonghua Wang, Hongwei Liu, Yixiao Fang, Songyang Zhang, Dahua Lin, Kai Chen
Abstract: Interacting with human via high-quality multi-turn dialogues is a key feature of large language models (LLMs). However, human-based evaluation of such capability involves intensive manual labor. This report provides a preliminary evaluation of existing large language models for human-style multi-turn chatting, through an LLM-based approach. We start from real-world human dialogues and keep the very first utterances as the ChatSEED. Then we prompt LLMs to generate a full multi-turn dialogue (tens of utterances) based on the ChatSEED, utterance by utterance. Finally, we adopt state-of-the-art LLMs (GPT-4, \etc) as the judge to evaluate the generated dialogues. With different evaluation protocols, we come to substantially identical conclusions. We find that GPT-4 can generate human-style multi-turn dialogues with impressive quality, significantly outperforms its counterparts. It's difficult for a discriminator to distinguish between GPT-4 generated dialogues and human dialogues. In contrast, other LLMs struggle to generate multi-turn dialogues of satisfactory quality due to poor instruction-following capability, tendency to generate lengthy utterances, or limited general capability. All data and codes will be provided in https://github.com/open-compass/BotChat/ and we hope they can serve as a valuable resource for evaluating multi-turn chatting capabilities of LLMs.
Ask questions about this paper to our AI assistant
You can also chat with multiple papers at once here.
⚠The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.
Assess the quality of the AI-generated content by voting
Score: 0
Why do we need votes?
Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.
Similar papers summarized with our AI tools
Navigate through even more similar papers through a
tree representationLook for similar papers (in beta version)
By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.
Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.