VeRO: A Harness for Agents to Optimize Agents

AI-generated keywords: VeRO agent optimization long-horizon target tasks TerminalBench-2 interpretability

AI-generated Key Points

Study explores applicability of VeRO in optimizing agents for complex and long-horizon coding tasks
Case study conducted using TerminalBench-2 benchmark with Terminus-KIRA agent and Claude Haiku 4.5 LLM
Two modes tested: Tools interface and Filesystem interface for exposing execution traces and dataset content to optimizer
Three optimization runs conducted using Claude Code (Sonnet 4.5) with varying sample budgets under different interfaces
Results show improvements in pass rates over baseline agent, identifying fixes that enhance performance
Nuanced dynamics observed with different combinations of fixes found in each run, highlighting complexity of optimizing agent harnesses
Interpretability explored through Git commit histories to investigate semantic trends in optimization process for various tasks
GPT-4.1 used to tag changes made by coding agent during each optimization trajectory, providing insights into impact on agent performance
Study emphasizes effectiveness of VeRO in optimizing agents for long-horizon coding tasks and importance of interpretability in understanding optimization processes

Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Varun Ursekar, Apaar Shanker, Veronica Chatrath, Yuan Xue, Samuel Marc Denton

arXiv: 2602.22480v4 - DOI (cs.AI)

Accepted to the Forty-Third International Conference on Machine Learning (ICML), 2026

License: CC BY 4.0

Abstract: An important emerging application of coding agents is agent harness optimization: the iterative improvement of a target agent by editing and evaluating its code. Despite its relevance, the community lacks a systematic understanding of coding agent performance on this task. Harness optimization differs from conventional software engineering: agent harnesses interleave deterministic code with stochastic LLM completions, requiring structured capture of both intermediate execution traces and downstream outcomes. To address these challenges, we introduce (1) VeRO (Versioning, Rewards, and Observations), an outer harness that provides versioned snapshots, budget-controlled evaluation, and structured execution traces of target harnesses, and (2) VeRO-Bench, a benchmark suite of target agents and tasks with reference evaluation procedures. Using VeRO, we conduct an empirical study comparing optimizers across tasks and analyzing which modifications reliably improve target agent harnesses. We release VeRO to support research on agent optimization as a core capability for coding agents. Code is available at https://github.com/scaleapi/vero.

Submitted to arXiv on 25 Feb. 2026

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2602.22480v4

Comprehensive Summary
Key points
Layman's Summary
Blog article

This study explores the applicability of VeRO in optimizing agents for complex and long-horizon coding tasks. A case study is conducted using TerminalBench-2, a benchmark consisting of 89 terminal tasks evaluated in sandboxed containers. The base target agent used is Terminus-KIRA with Claude Haiku 4.5 as the underlying LLM. Two modes of exposing execution traces and dataset content to the optimizer are tested: Tools interface and Filesystem interface. Three optimization runs are conducted using Claude Code (Sonnet 4.5) as the optimizer with varying sample budgets under different interfaces. Results show improvements in pass rates over the baseline Terminus-KIRA agent, with both Tools and Filesystem interfaces identifying fixes that enhance performance. However, nuanced dynamics are observed where different runs find different combinations of fixes, indicating the complexity of optimizing agent harnesses. Interpretability is also explored by leveraging Git commit histories to investigate semantic trends in the optimization process for various tasks. GPT-4.1 is used to tag changes made by the coding agent during each optimization trajectory, providing insights into how optimizations impact agent performance. Overall, this study highlights the effectiveness of VeRO in optimizing agents for long-horizon coding tasks and underscores the importance of interpretability in understanding optimization processes. The findings contribute to advancing research on agent optimization as a crucial capability for coding agents.

- Study explores applicability of VeRO in optimizing agents for complex and long-horizon coding tasks
- Case study conducted using TerminalBench-2 benchmark with Terminus-KIRA agent and Claude Haiku 4.5 LLM
- Two modes tested: Tools interface and Filesystem interface for exposing execution traces and dataset content to optimizer
- Three optimization runs conducted using Claude Code (Sonnet 4.5) with varying sample budgets under different interfaces
- Results show improvements in pass rates over baseline agent, identifying fixes that enhance performance
- Nuanced dynamics observed with different combinations of fixes found in each run, highlighting complexity of optimizing agent harnesses
- Interpretability explored through Git commit histories to investigate semantic trends in optimization process for various tasks
- GPT-4.1 used to tag changes made by coding agent during each optimization trajectory, providing insights into impact on agent performance
- Study emphasizes effectiveness of VeRO in optimizing agents for long-horizon coding tasks and importance of interpretability in understanding optimization processes

SummaryA study looked at how to make computer programs better at solving difficult coding tasks that take a long time. They used a special test called TerminalBench-2 with two different ways for the computer program to show its work. The study tried changing the program in three different ways and saw improvements in how well it did its job compared to before. By looking at the history of changes made by the program, they learned more about how to make it work even better. Definitions- Applicability: How useful something is in a specific situation. - Optimizing: Making something work better or more efficiently. - Agents: Programs or systems that can perform tasks on their own. - Benchmark: A standard test or measurement used for comparison. - Interface: The way two things connect or communicate with each other. - Dataset: A collection of data or information. - Optimization runs: Different attempts at improving something. - Baseline agent: The original version of a program used for comparison. - Interpretability: Being able to understand and explain something clearly. - Semantic trends: Patterns related to the meaning of words or data. - Trajectory: The path or course taken during a process.

VeRO (Verification and Repair Optimization) is a powerful tool that has been gaining attention in the field of artificial intelligence (AI). It allows for the optimization of agents, specifically in complex and long-horizon coding tasks. A recent study conducted by researchers at TerminalBench-2 explores the applicability of VeRO in optimizing agents for these types of tasks. The study begins by introducing TerminalBench-2, a benchmark consisting of 89 terminal tasks evaluated in sandboxed containers. This benchmark provides a diverse range of challenges for coding agents to tackle, making it an ideal environment to test the effectiveness of VeRO. The base target agent used in this study is Terminus-KIRA with Claude Haiku 4.5 as the underlying LLM (Language Model). To test the effectiveness of VeRO, two modes of exposing execution traces and dataset content to the optimizer are tested: Tools interface and Filesystem interface. Three optimization runs are then conducted using Claude Code (Sonnet 4.5) as the optimizer with varying sample budgets under different interfaces. The results show significant improvements in pass rates over the baseline Terminus-KIRA agent when using both Tools and Filesystem interfaces. This indicates that VeRO was able to identify fixes that enhance performance for coding agents on complex and long-horizon tasks. However, it is important to note that nuanced dynamics were observed during these optimizations. Different runs found different combinations of fixes, highlighting the complexity involved in optimizing agent harnesses. This further emphasizes the need for advanced tools like VeRO to aid in this process. One interesting aspect explored in this study is interpretability – understanding how optimizations impact agent performance through analyzing their processes. To achieve this, Git commit histories were leveraged to investigate semantic trends during various optimization trajectories for different tasks. GPT-4.1 was used to tag changes made by the coding agent during each optimization trajectory, providing insights into how optimizations impact agent performance. The results of this interpretability analysis showed that VeRO was able to effectively optimize agents for long-horizon coding tasks. It also highlighted the importance of understanding and interpreting the optimization process in order to gain insights into how it impacts agent performance. Overall, this study contributes to advancing research on agent optimization as a crucial capability for coding agents. The findings demonstrate the effectiveness of VeRO in optimizing agents for complex and long-horizon tasks, while also emphasizing the need for interpretability in understanding and improving this process. As AI continues to advance, tools like VeRO will play a vital role in optimizing agents and pushing the boundaries of what they can achieve.

Created on 13 Jun. 2026

Assess the quality of the AI-generated content by voting

Score: 0

Similar papers summarized with our AI tools

58.1%

PlanGEN: A Multi-Agent Framework for Generating Planning and Reasoning Trajec…

cs.AI

57.3%

Aviary: training language agents on challenging scientific tasks

cs.AI

57.0%

EvoAgent: Towards Automatic Multi-Agent Generation via Evolutionary Algorithms

cs.AI

54.4%

Survey on Evaluation of LLM-based Agents

cs.AI

53.5%

MMAC-Copilot: Multi-modal Agent Collaboration Operating System Copilot

cs.AI

53.4%

AgentGroupChat: An Interactive Group Chat Simulacra For Better Eliciting Emer…

cs.AI

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.