LiveMCP-101: Stress Testing and Diagnosing MCP-enabled Agents on Challenging Queries

AI-generated keywords: AI agent capabilities Model Context Protocol LiveMCP-101 tool orchestration real-world scenarios

AI-generated Key Points

  • The Model Context Protocol (MCP) provides a standardized framework for integrating tools in AI agent tasks.
  • LiveMCP-101 is a benchmark with 101 real-world queries that require multiple MCP tools for completion.
  • Evaluation of AI agents using LiveMCP-101 shows challenges in orchestrating tools effectively, with success rates below 60%.
  • Detailed analysis reveals failure modes and inefficiencies in token usage within current models, offering insights for improvement.
  • The benchmark sets a rigorous standard for assessing real-world agent capabilities and aims to pave the way for autonomous AI systems proficient in executing complex tasks.
  • Additional complexities arise from scenarios like market research projects and personal requests, highlighting the need for advanced AI systems capable of navigating diverse tool chains efficiently.
  • Evaluation metrics include task success rate (TSR), average result score (ARS), average trajectory score (ATS), average tool calls, and average tokens used to provide comprehensive insights into model performance across different tasks.
  • An innovative LLM-as-a-Judge approach has been employed to assess final outputs and execution trajectories, with human-expert studies conducted to compare judgments between experts and LLM judges.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Ming Yin, Dinghan Shen, Silei Xu, Jianbing Han, Sixun Dong, Mian Zhang, Yebowen Hu, Shujian Liu, Simin Ma, Song Wang, Sathish Reddy Indurthi, Xun Wang, Yiran Chen, Kaiqiang Song

License: CC BY 4.0

Abstract: Tool calling has emerged as a critical capability for AI agents to interact with the real world and solve complex tasks. While the Model Context Protocol (MCP) provides a powerful standardized framework for tool integration, there is a significant gap in benchmarking how well AI agents can effectively solve multi-step tasks using diverse MCP tools in realistic, dynamic scenarios. In this work, we present LiveMCP-101, a benchmark of 101 carefully curated real-world queries, refined through iterative LLM rewriting and manual review, that require coordinated use of multiple MCP tools including web search, file operations, mathematical reasoning, and data analysis. Moreover, we introduce a novel evaluation approach that leverages ground-truth execution plans rather than raw API outputs, better reflecting the evolving nature of real-world environments. Experiments show that even frontier LLMs achieve a success rate below 60\%, highlighting major challenges in tool orchestration. Detailed ablations and error analysis further reveal distinct failure modes and inefficiencies in token usage, pointing to concrete directions for advancing current models. LiveMCP-101 sets a rigorous standard for evaluating real-world agent capabilities, advancing toward autonomous AI systems that reliably execute complex tasks through tool use.

Submitted to arXiv on 21 Aug. 2025

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2508.15760v1

The use of AI agents in complex tasks has become increasingly important, with tool calling being a crucial aspect for interacting with the real world. To address this challenge, the Model Context Protocol (MCP) offers a standardized framework for integrating tools. However, there is still a gap in evaluating how effectively AI agents can solve multi-step tasks using diverse MCP tools in dynamic scenarios. To bridge this gap, LiveMCP-101 has been introduced as a benchmark consisting of 101 carefully curated real-world queries that require the coordinated use of multiple MCP tools such as web search, file operations, mathematical reasoning, and data analysis. This benchmark was refined through iterative LLM rewriting and manual review processes to ensure its accuracy and relevance. A novel evaluation approach has also been implemented for LiveMCP-101 by leveraging ground-truth execution plans instead of raw API outputs to better capture the evolving nature of real-world environments. Experimental results have shown that even cutting-edge LLMs struggle to achieve success rates above 60%, highlighting the significant challenges in orchestrating tools effectively. Detailed ablations and error analyses have further highlighted distinct failure modes and inefficiencies in token usage within current models, providing concrete insights for enhancing their performance. LiveMCP-101 sets a rigorous standard for assessing real-world agent capabilities and paves the way towards autonomous AI systems capable of reliably executing complex tasks through adept tool utilization. Furthermore, within the context of BrightPath Analytics' market research project on digital art engagement commissioned by artist Lucia Moretti and a personal request for NBA game tickets from a father to surprise his son, additional complexities arise that demand multi-step composition of heterogeneous MCP tools with precise parameterization and output handling. These scenarios underscore the need for advanced AI systems that can seamlessly navigate diverse tool chains to fulfill intricate user requests across various domains. The evaluation framework encompasses metrics such as task success rate (TSR), average result score (ARS), average trajectory score (ATS), average tool calls, and average tokens used to provide comprehensive insights into model performance across different tasks. Additionally, an innovative LLM-as-a-Judge approach has been employed to assess both final outputs and execution trajectories, with human-expert studies conducted to compare judgments between experts and LLM judges. Overall, this expanded summary showcases the depth and complexity of evaluating AI agent capabilities in real-world scenarios while emphasizing the importance of efficient tool orchestration for achieving successful task completion.
Created on 06 Nov. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.