VeRO: A Harness for Agents to Optimize Agents

AI-generated keywords: VeRO agent optimization long-horizon target tasks TerminalBench-2 interpretability

AI-generated Key Points

  • Study explores applicability of VeRO in optimizing agents for complex and long-horizon coding tasks
  • Case study conducted using TerminalBench-2 benchmark with Terminus-KIRA agent and Claude Haiku 4.5 LLM
  • Two modes tested: Tools interface and Filesystem interface for exposing execution traces and dataset content to optimizer
  • Three optimization runs conducted using Claude Code (Sonnet 4.5) with varying sample budgets under different interfaces
  • Results show improvements in pass rates over baseline agent, identifying fixes that enhance performance
  • Nuanced dynamics observed with different combinations of fixes found in each run, highlighting complexity of optimizing agent harnesses
  • Interpretability explored through Git commit histories to investigate semantic trends in optimization process for various tasks
  • GPT-4.1 used to tag changes made by coding agent during each optimization trajectory, providing insights into impact on agent performance
  • Study emphasizes effectiveness of VeRO in optimizing agents for long-horizon coding tasks and importance of interpretability in understanding optimization processes
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Varun Ursekar, Apaar Shanker, Veronica Chatrath, Yuan Xue, Samuel Marc Denton

Accepted to the Forty-Third International Conference on Machine Learning (ICML), 2026
License: CC BY 4.0

Abstract: An important emerging application of coding agents is agent harness optimization: the iterative improvement of a target agent by editing and evaluating its code. Despite its relevance, the community lacks a systematic understanding of coding agent performance on this task. Harness optimization differs from conventional software engineering: agent harnesses interleave deterministic code with stochastic LLM completions, requiring structured capture of both intermediate execution traces and downstream outcomes. To address these challenges, we introduce (1) VeRO (Versioning, Rewards, and Observations), an outer harness that provides versioned snapshots, budget-controlled evaluation, and structured execution traces of target harnesses, and (2) VeRO-Bench, a benchmark suite of target agents and tasks with reference evaluation procedures. Using VeRO, we conduct an empirical study comparing optimizers across tasks and analyzing which modifications reliably improve target agent harnesses. We release VeRO to support research on agent optimization as a core capability for coding agents. Code is available at https://github.com/scaleapi/vero.

Submitted to arXiv on 25 Feb. 2026

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2602.22480v4

This study explores the applicability of VeRO in optimizing agents for complex and long-horizon coding tasks. A case study is conducted using TerminalBench-2, a benchmark consisting of 89 terminal tasks evaluated in sandboxed containers. The base target agent used is Terminus-KIRA with Claude Haiku 4.5 as the underlying LLM. Two modes of exposing execution traces and dataset content to the optimizer are tested: Tools interface and Filesystem interface. Three optimization runs are conducted using Claude Code (Sonnet 4.5) as the optimizer with varying sample budgets under different interfaces. Results show improvements in pass rates over the baseline Terminus-KIRA agent, with both Tools and Filesystem interfaces identifying fixes that enhance performance. However, nuanced dynamics are observed where different runs find different combinations of fixes, indicating the complexity of optimizing agent harnesses. Interpretability is also explored by leveraging Git commit histories to investigate semantic trends in the optimization process for various tasks. GPT-4.1 is used to tag changes made by the coding agent during each optimization trajectory, providing insights into how optimizations impact agent performance. Overall, this study highlights the effectiveness of VeRO in optimizing agents for long-horizon coding tasks and underscores the importance of interpretability in understanding optimization processes. The findings contribute to advancing research on agent optimization as a crucial capability for coding agents.
Created on 13 Jun. 2026

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.