TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks

AI-generated keywords: Digital Age

AI-generated Key Points

  • Interactions with computers are integral in personal and professional lives in the digital age
  • Large language models (LLMs) have enabled rapid evolution of AI agents to perform work-related tasks
  • TheAgentCompany benchmark assesses performance of LLM agents in a simulated professional environment
  • Competitive agent autonomously completed 24% of tasks, indicating effectiveness for simpler tasks but challenges for complex ones
  • TheAgentCompany offers a comprehensive evaluation framework for AI agents interacting like human workers
  • Collaborative effort behind TheAgentCompany involved multiple institutions and individuals contributing to its development
  • Valuable tool for assessing AI agent performance in real-world work scenarios and advancing research in AI automation
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Frank F. Xu, Yufan Song, Boxuan Li, Yuxuan Tang, Kritanjali Jain, Mengxue Bao, Zora Z. Wang, Xuhui Zhou, Zhitong Guo, Murong Cao, Mingyang Yang, Hao Yang Lu, Amaad Martin, Zhe Su, Leander Maben, Raj Mehta, Wayne Chi, Lawrence Jang, Yiqing Xie, Shuyan Zhou, Graham Neubig

Preprint
License: CC BY 4.0

Abstract: We interact with computers on an everyday basis, be it in everyday life or work, and many aspects of work can be done entirely with access to a computer and the Internet. At the same time, thanks to improvements in large language models (LLMs), there has also been a rapid development in AI agents that interact with and affect change in their surrounding environments. But how performant are AI agents at helping to accelerate or even autonomously perform work-related tasks? The answer to this question has important implications for both industry looking to adopt AI into their workflows, and for economic policy to understand the effects that adoption of AI may have on the labor market. To measure the progress of these LLM agents' performance on performing real-world professional tasks, in this paper, we introduce TheAgentCompany, an extensible benchmark for evaluating AI agents that interact with the world in similar ways to those of a digital worker: by browsing the Web, writing code, running programs, and communicating with other coworkers. We build a self-contained environment with internal web sites and data that mimics a small software company environment, and create a variety of tasks that may be performed by workers in such a company. We test baseline agents powered by both closed API-based and open-weights language models (LMs), and find that with the most competitive agent, 24% of the tasks can be completed autonomously. This paints a nuanced picture on task automation with LM agents -- in a setting simulating a real workplace, a good portion of simpler tasks could be solved autonomously, but more difficult long-horizon tasks are still beyond the reach of current systems.

Submitted to arXiv on 18 Dec. 2024

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2412.14161v1

, , , , In today's digital age, our interactions with computers have become an integral part of both our personal and professional lives. With the advancements in large language models (LLMs), artificial intelligence (AI) agents have rapidly evolved to interact with and influence their environments. The question arises: how effective are these AI agents in accelerating or autonomously performing work-related tasks? This inquiry holds significant implications for industries considering AI integration into their workflows and for policymakers seeking to understand the impact on the labor market. To assess the performance of LLM agents in real-world professional tasks, a new benchmark called TheAgentCompany has been introduced. This benchmark simulates a digital worker's activities within a small software company environment, including web browsing, coding, program execution, and communication with colleagues. Baseline agents powered by closed API-based and open-weights language models were tested within this environment. The results revealed that the most competitive agent was able to autonomously complete 24% of the tasks. This nuanced finding suggests that while simpler tasks can be automated effectively by current systems, more complex long-horizon tasks still pose challenges. TheAgentCompany offers a comprehensive evaluation framework for AI agents interacting with the world like human workers. Furthermore, comparisons were drawn between TheAgentCompany and other existing benchmarks in terms of task diversity, realism, interface capabilities, self-hosted environments, interaction requirements, checkpoint evaluations, and NPC agent interactions. The collaborative effort behind TheAgentCompany involved multiple institutions and individuals contributing to task design, infrastructure development, experiments, Sotopia integration, task development, ideation discussions formulation under the guidance of project leads. Acknowledgments were extended to Open Philanthropy for funding support and various individuals for insightful discussions throughout the project. Overall, TheAgentCompany presents a valuable tool for assessing AI agent performance in real-world work scenarios and contributes to advancing research in AI automation within professional settings.
Created on 09 Jan. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.