Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses

AI-generated keywords: Coding-agent performance Harness engineering Agentic Harness Engineering (AHE) Observability pillars Autonomous harness evolution

AI-generated Key Points

The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

  • Coding-agent performance plays a pivotal role in facilitating interactions between models and tools.
  • Challenges in coding-agent performance stem from diverse action space, extensive trajectories, and difficult-to-trace edits.
  • Agentic Harness Engineering (AHE) introduces a closed-loop system with three observability pillars: component observability, experience observability, and decision observability.
  • AHE transforms edits into falsifiable contracts to improve performance without trial-and-error methods.
  • Empirical results show that AHE significantly improves pass@1 performance on Terminal-Bench 2 from 69.7% to 77.0%, surpassing human-designed harness Codex-CLI and self-evolving baselines ACE and TF-GRPO.
  • The evolved harness can be transferred without re-evolution, achieving superior success rates with fewer tokens than the initial seed across different model families on Terminal-Bench 2.
  • Ablation studies attribute improvement to tools, middleware, and long-term memory rather than system prompts.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Jiahang Lin, Shichun Liu, Chengjun Pan, Lizhi Lin, Shihan Dou, Zhiheng Xi, Xuanjing Huang, Hang Yan, Zhenhua Han, Tao Gui, Yu-Gang Jiang

Abstract: Harnesses are now central to coding-agent performance, mediating how models interact with tools and execution environments. Yet harness engineering remains a manual craft, because automating it faces a heterogeneous action space across editable components, voluminous trajectories that bury actionable signal, and edits whose effect is hard to attribute. We introduce Agentic Harness Engineering (AHE), a closed loop that addresses these challenges through three matched observability pillars: (1) component observability gives every editable harness component a file-level representation so the action space is explicit and revertible; (2) experience observability distills millions of raw trajectory tokens into a layered, drill-down evidence corpus that an evolving agent can actually consume; and (3) decision observability pairs every edit with a self-declared prediction, later verified against the next round's task-level outcomes. Together, these pillars turn every edit into a falsifiable contract, so harness evolution proceeds autonomously without collapsing into trial-and-error. Empirically, ten AHE iterations lift pass@1 on Terminal-Bench 2 from 69.7% to 77.0%, surpassing the human-designed harness Codex-CLI (71.9%) and the self-evolving baselines ACE and TF-GRPO. The frozen harness transfers without re-evolution: on SWE-bench-verified it tops aggregate success at 12% fewer tokens than the seed, and on Terminal-Bench 2 it yields +5.1 to +10.1pp cross-family gains across three alternate model families, indicating the evolved components encode general engineering experience rather than benchmark-specific tuning. Ablations localize the gain to tools, middleware, and long-term memory rather than the system prompt, suggesting factual harness structure transfers while prose-level strategy does not.

Submitted to arXiv on 28 Apr. 2026

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2604.25850v4

This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

In the realm of coding-agent performance, play a pivotal role in facilitating interactions between models and their respective tools and execution environments. Despite their significance, remains a predominantly manual process due to the challenges posed by a diverse action space across editable components, extensive trajectories that obscure actionable insights, and edits whose impact is difficult to trace. To address these obstacles, the concept of is introduced as a closed-loop system comprising three interconnected observability pillars. The first pillar focuses on component observability, providing each editable harness component with a file-level representation to make the action space explicit and reversible. The second pillar, experience observability, distills vast amounts of raw trajectory data into a layered evidence corpus that can be effectively consumed by an evolving agent. Lastly, decision observability pairs every edit with a self-declared prediction, which is later validated against task-level outcomes in subsequent rounds. By transforming each edit into a falsifiable contract, these pillars enable without resorting to trial-and-error methods. Empirical results demonstrate the efficacy of AHE, showcasing how ten iterations significantly improve pass@1 performance on Terminal-Bench 2 from 69.7% to 77.0%. This surpasses both human-designed harness Codex-CLI (71.9%) and self-evolving baselines ACE and TF-GRPO. Furthermore, the evolved harness can be transferred without requiring re-evolution: achieving superior success rates on SWE-bench-verified with 12% fewer tokens than the initial seed. Across three different model families on Terminal-Bench 2, cross-family gains ranging from +5.1 to +10.1pp indicate that the evolved components encapsulate general engineering experience rather than benchmark-specific tuning. Ablation studies pinpoint the source of improvement to tools, middleware, and long-term memory rather than system prompts – suggesting that factual harness structure transfers effectively while prose-level strategies do not exhibit similar benefits. The collaborative effort of authors including Jiahang Lin, Shichun Liu, Chengjun Pan, Lizhi Lin, Shihan Dou, Zhiheng Xi, Xuanjing Huang, Hang Yan, Zhenhua Han, Tao Gui, and Yu-Gang Jiang has led to groundbreaking advancements in through their research titled "Agentic Harness Engineering: Observability Pillars for Autonomous Evolution".
Created on 13 Jun. 2026

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.