LLMs Get Lost In Multi-Turn Conversation

AI-generated keywords: Large Language Models Multi-turn Conversations Performance Evaluation Simulation Experiments Challenges

AI-generated Key Points

  • Study focused on evaluating Large Language Models (LLMs) in multi-turn conversations vs. single-turn interactions
  • Significant drop in performance in multi-turn settings, with an average decrease of 39%
  • Performance degradation attributed to minor loss in aptitude and significant increase in unreliability
  • LLMs tend to make assumptions early on, leading to reliance on incorrect information
  • Tasks included Data-to-Text and Summary generation testing long-context capabilities
  • Different metrics used for evaluation: binary correctness and continuous scoring
  • Sharding process detailed in the appendix for reproducibility and future research
  • Previous-generation models not designed for multi-turn conversations, but interest growing with advancements like ChatGPT
  • Efforts made to evaluate LLMs' ability in longer conversations through crowd-sourced annotations and expanded frameworks
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Philippe Laban, Hiroaki Hayashi, Yingbo Zhou, Jennifer Neville

License: CC BY 4.0

Abstract: Large Language Models (LLMs) are conversational interfaces. As such, LLMs have the potential to assist their users not only when they can fully specify the task at hand, but also to help them define, explore, and refine what they need through multi-turn conversational exchange. Although analysis of LLM conversation logs has confirmed that underspecification occurs frequently in user instructions, LLM evaluation has predominantly focused on the single-turn, fully-specified instruction setting. In this work, we perform large-scale simulation experiments to compare LLM performance in single- and multi-turn settings. Our experiments confirm that all the top open- and closed-weight LLMs we test exhibit significantly lower performance in multi-turn conversations than single-turn, with an average drop of 39% across six generation tasks. Analysis of 200,000+ simulated conversations decomposes the performance degradation into two components: a minor loss in aptitude and a significant increase in unreliability. We find that LLMs often make assumptions in early turns and prematurely attempt to generate final solutions, on which they overly rely. In simpler terms, we discover that *when LLMs take a wrong turn in a conversation, they get lost and do not recover*.

Submitted to arXiv on 09 May. 2025

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2505.06120v1

This study focuses on evaluating the performance of Large Language Models (LLMs) in multi-turn conversations compared to single-turn interactions. The researchers conducted large-scale simulation experiments using top open- and closed-weight LLMs across six generation tasks. The results showed a significant drop in performance in multi-turn settings, with an average decrease of 39%. Analysis of over 200,000 simulated conversations revealed that the performance degradation was attributed to a minor loss in aptitude and a significant increase in unreliability. It was found that LLMs often make assumptions early on in the conversation and prematurely attempt to generate final solutions, leading to reliance on incorrect information. The study also included tasks such as Data-to-Text and Summary generation, which tested long-context capabilities known to impact model performance. Different metrics were used for evaluation, including binary correctness for certain tasks and continuous scoring for refinement tasks. The sharding process for each task was detailed in the appendix to facilitate reproducibility and future research on LLM evaluation in diverse settings. Previous-generation language models were not designed for multi-turn conversations, leading to a focus on single-turn tasks during evaluation. However, with advancements like ChatGPT sparking interest in multi-turn assessment, efforts have been made to evaluate LLMs' ability in longer conversations through crowd-sourced annotations and expanded evaluation frameworks. This study contributes valuable insights into the challenges faced by LLMs in navigating multi-turn dialogues effectively.
Created on 15 May. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.