VeRO: A Harness for Agents to Optimize Agents
AI-generated Key Points
- Study explores applicability of VeRO in optimizing agents for complex and long-horizon coding tasks
- Case study conducted using TerminalBench-2 benchmark with Terminus-KIRA agent and Claude Haiku 4.5 LLM
- Two modes tested: Tools interface and Filesystem interface for exposing execution traces and dataset content to optimizer
- Three optimization runs conducted using Claude Code (Sonnet 4.5) with varying sample budgets under different interfaces
- Results show improvements in pass rates over baseline agent, identifying fixes that enhance performance
- Nuanced dynamics observed with different combinations of fixes found in each run, highlighting complexity of optimizing agent harnesses
- Interpretability explored through Git commit histories to investigate semantic trends in optimization process for various tasks
- GPT-4.1 used to tag changes made by coding agent during each optimization trajectory, providing insights into impact on agent performance
- Study emphasizes effectiveness of VeRO in optimizing agents for long-horizon coding tasks and importance of interpretability in understanding optimization processes
Authors: Varun Ursekar, Apaar Shanker, Veronica Chatrath, Yuan Xue, Samuel Marc Denton
Abstract: An important emerging application of coding agents is agent harness optimization: the iterative improvement of a target agent by editing and evaluating its code. Despite its relevance, the community lacks a systematic understanding of coding agent performance on this task. Harness optimization differs from conventional software engineering: agent harnesses interleave deterministic code with stochastic LLM completions, requiring structured capture of both intermediate execution traces and downstream outcomes. To address these challenges, we introduce (1) VeRO (Versioning, Rewards, and Observations), an outer harness that provides versioned snapshots, budget-controlled evaluation, and structured execution traces of target harnesses, and (2) VeRO-Bench, a benchmark suite of target agents and tasks with reference evaluation procedures. Using VeRO, we conduct an empirical study comparing optimizers across tasks and analyzing which modifications reliably improve target agent harnesses. We release VeRO to support research on agent optimization as a core capability for coding agents. Code is available at https://github.com/scaleapi/vero.
Ask questions about this paper to our AI assistant
You can also chat with multiple papers at once here.
Assess the quality of the AI-generated content by voting
Score: 0
Why do we need votes?
Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.
Similar papers summarized with our AI tools
Navigate through even more similar papers through a
tree representationLook for similar papers (in beta version)
By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.
Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.