ScribeAgent: Towards Specialized Web Agents Using Production-Scale Workflow Data

AI-generated keywords: Innovative Approach

AI-generated Key Points

  • Introduction of innovative approach to enhancing Large Language Model (LLM) agents by fine-tuning open-source models with real-world workflow data
  • Reformatting proprietary workflow data into next-step prediction format and training LLMs using LoRA framework
  • Development of specialized web agents, ScribeAgent, excelling in web understanding and planning
  • Superior performance demonstrated on public benchmarks Mind2Web and WebArena compared to existing models
  • State-of-the-art direct generation results achieved on Mind2Web with 32B-parameter ScribeAgent-Large
  • Significant improvement in task success rates on WebArena with 7B ScribeAgent-Small
  • Insights provided for future web agent research, including direct fine-tuning on structured inputs like HTML-DOM and effective HTML preprocessing strategies
  • Analysis of design choices in fine-tuning process, such as LLM backbone selection and context window optimization
  • Potential benefits of developing specialized web agents through fine-tuning with large-scale real-world data for improved performance and reduced serving costs
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Junhong Shen, Atishay Jain, Zedian Xiao, Ishan Amlekar, Mouad Hadji, Aaron Podolny, Ameet Talwalkar

License: CC BY 4.0

Abstract: Large Language Model (LLM) agents are rapidly improving to handle increasingly complex web-based tasks. Most of these agents rely on general-purpose, proprietary models like GPT-4 and focus on designing better prompts to improve their planning abilities. However, general-purpose LLMs are not specifically trained to understand specialized web contexts such as HTML, and they often struggle with long-horizon planning. We explore an alternative approach that fine-tunes open-source LLMs using production-scale workflow data collected from over 250 domains corresponding to 6 billion tokens. This simple yet effective approach shows substantial gains over prompting-based agents on existing benchmarks -- ScribeAgent achieves state-of-the-art direct generation performance on Mind2Web and improves the task success rate by 14.1% over the previous best text-only web agents on WebArena. We further perform detailed ablation studies on various fine-tuning design choices and provide insights into LLM selection, training recipes, context window optimization, and effect of dataset sizes.

Submitted to arXiv on 22 Nov. 2024

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2411.15004v1

, , , , In this study, we introduce an innovative approach to enhancing Large Language Model (LLM) agents by fine-tuning open-source models with a vast dataset of real-world workflow data. Our methodology involves reformatting the proprietary workflow data into a next-step prediction format and training the LLMs using the parameter-efficient LoRA framework. With over 6 billion tokens in our training dataset, we develop specialized web agents, called ScribeAgent, that excel in web understanding and planning. Through extensive evaluations on public benchmarks like Mind2Web and WebArena, ScribeAgent demonstrates superior performance compared to existing GPT-4-based models and multi-stage agents. The 32B-parameter ScribeAgent-Large achieves state-of-the-art direct generation results on Mind2Web, surpassing baseline performance by 5-10% across all test sets. On WebArena, our 7B ScribeAgent-Small significantly improves task success rates from 37.2% to 51.3%, establishing itself as the top-performing text-only LLM agent. Our study also offers valuable insights for future web agent research. We highlight the feasibility and benefits of direct fine-tuning on structured inputs like HTML-DOM to enhance target identification accuracy. Additionally, we propose effective strategies for HTML preprocessing to balance information retention with context length minimization. Our analysis delves into various design choices in fine-tuning, such as LLM backbone selection and context window optimization. Overall, our work underscores the potential of developing specialized web agents through fine-tuning with large-scale real-world data. This approach not only enhances agent capabilities relative to prompt-engineered alternatives but also enables the creation of more efficient models with reduced serving costs. By showcasing the effectiveness of specialized fine-tuning in improving agent performance across diverse benchmarks, we pave the way for advancements in web agent development and deployment strategies.
Created on 18 May. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.