Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses

AI-generated keywords: Coding-agent performance Harness engineering Agentic Harness Engineering (AHE) Observability pillars Autonomous harness evolution

AI-generated Key Points

⚠The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

Coding-agent performance plays a pivotal role in facilitating interactions between models and tools.
Challenges in coding-agent performance stem from diverse action space, extensive trajectories, and difficult-to-trace edits.
Agentic Harness Engineering (AHE) introduces a closed-loop system with three observability pillars: component observability, experience observability, and decision observability.
AHE transforms edits into falsifiable contracts to improve performance without trial-and-error methods.
Empirical results show that AHE significantly improves pass@1 performance on Terminal-Bench 2 from 69.7% to 77.0%, surpassing human-designed harness Codex-CLI and self-evolving baselines ACE and TF-GRPO.
The evolved harness can be transferred without re-evolution, achieving superior success rates with fewer tokens than the initial seed across different model families on Terminal-Bench 2.
Ablation studies attribute improvement to tools, middleware, and long-term memory rather than system prompts.

Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Jiahang Lin, Shichun Liu, Chengjun Pan, Lizhi Lin, Shihan Dou, Zhiheng Xi, Xuanjing Huang, Hang Yan, Zhenhua Han, Tao Gui, Yu-Gang Jiang

arXiv: 2604.25850v4 - DOI (cs.CL)

License: NONEXCLUSIVE-DISTRIB 1.0

Abstract: Harnesses are now central to coding-agent performance, mediating how models interact with tools and execution environments. Yet harness engineering remains a manual craft, because automating it faces a heterogeneous action space across editable components, voluminous trajectories that bury actionable signal, and edits whose effect is hard to attribute. We introduce Agentic Harness Engineering (AHE), a closed loop that addresses these challenges through three matched observability pillars: (1) component observability gives every editable harness component a file-level representation so the action space is explicit and revertible; (2) experience observability distills millions of raw trajectory tokens into a layered, drill-down evidence corpus that an evolving agent can actually consume; and (3) decision observability pairs every edit with a self-declared prediction, later verified against the next round's task-level outcomes. Together, these pillars turn every edit into a falsifiable contract, so harness evolution proceeds autonomously without collapsing into trial-and-error. Empirically, ten AHE iterations lift pass@1 on Terminal-Bench 2 from 69.7% to 77.0%, surpassing the human-designed harness Codex-CLI (71.9%) and the self-evolving baselines ACE and TF-GRPO. The frozen harness transfers without re-evolution: on SWE-bench-verified it tops aggregate success at 12% fewer tokens than the seed, and on Terminal-Bench 2 it yields +5.1 to +10.1pp cross-family gains across three alternate model families, indicating the evolved components encode general engineering experience rather than benchmark-specific tuning. Ablations localize the gain to tools, middleware, and long-term memory rather than the system prompt, suggesting factual harness structure transfers while prose-level strategy does not.

Submitted to arXiv on 28 Apr. 2026

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

⚠The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2604.25850v4

⚠This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

Comprehensive Summary
Key points
Layman's Summary
Blog article

In the realm of coding-agent performance, play a pivotal role in facilitating interactions between models and their respective tools and execution environments. Despite their significance, remains a predominantly manual process due to the challenges posed by a diverse action space across editable components, extensive trajectories that obscure actionable insights, and edits whose impact is difficult to trace. To address these obstacles, the concept of is introduced as a closed-loop system comprising three interconnected observability pillars. The first pillar focuses on component observability, providing each editable harness component with a file-level representation to make the action space explicit and reversible. The second pillar, experience observability, distills vast amounts of raw trajectory data into a layered evidence corpus that can be effectively consumed by an evolving agent. Lastly, decision observability pairs every edit with a self-declared prediction, which is later validated against task-level outcomes in subsequent rounds. By transforming each edit into a falsifiable contract, these pillars enable without resorting to trial-and-error methods. Empirical results demonstrate the efficacy of AHE, showcasing how ten iterations significantly improve pass@1 performance on Terminal-Bench 2 from 69.7% to 77.0%. This surpasses both human-designed harness Codex-CLI (71.9%) and self-evolving baselines ACE and TF-GRPO. Furthermore, the evolved harness can be transferred without requiring re-evolution: achieving superior success rates on SWE-bench-verified with 12% fewer tokens than the initial seed. Across three different model families on Terminal-Bench 2, cross-family gains ranging from +5.1 to +10.1pp indicate that the evolved components encapsulate general engineering experience rather than benchmark-specific tuning. Ablation studies pinpoint the source of improvement to tools, middleware, and long-term memory rather than system prompts – suggesting that factual harness structure transfers effectively while prose-level strategies do not exhibit similar benefits. The collaborative effort of authors including Jiahang Lin, Shichun Liu, Chengjun Pan, Lizhi Lin, Shihan Dou, Zhiheng Xi, Xuanjing Huang, Hang Yan, Zhenhua Han, Tao Gui, and Yu-Gang Jiang has led to groundbreaking advancements in through their research titled "Agentic Harness Engineering: Observability Pillars for Autonomous Evolution".

- Coding-agent performance plays a pivotal role in facilitating interactions between models and tools.
- Challenges in coding-agent performance stem from diverse action space, extensive trajectories, and difficult-to-trace edits.
- Agentic Harness Engineering (AHE) introduces a closed-loop system with three observability pillars: component observability, experience observability, and decision observability.
- AHE transforms edits into falsifiable contracts to improve performance without trial-and-error methods.
- Empirical results show that AHE significantly improves pass@1 performance on Terminal-Bench 2 from 69.7% to 77.0%, surpassing human-designed harness Codex-CLI and self-evolving baselines ACE and TF-GRPO.
- The evolved harness can be transferred without re-evolution, achieving superior success rates with fewer tokens than the initial seed across different model families on Terminal-Bench 2.
- Ablation studies attribute improvement to tools, middleware, and long-term memory rather than system prompts.

Summary- Coding-agent performance is important for making models and tools work together. - Challenges in coding-agent performance come from different actions, long paths, and hard-to-follow changes. - Agentic Harness Engineering (AHE) uses a system with three observability parts to make improvements. - AHE turns changes into testable agreements to get better results without guessing. - AHE makes models perform better on a test, beating other methods. Definitions- Coding-agent performance: How well a program works with other programs or tools. - Observability: Being able to see and understand what is happening in a system. - Falsifiable: Something that can be proven true or false through testing. - Empirical results: Information gathered through observation and experimentation. - Ablation studies: Experiments where parts of a system are removed to see their impact.

Introduction

In the world of coding, agent performance plays a crucial role in facilitating interactions between models and their respective tools and execution environments. However, this process remains predominantly manual due to various challenges such as a diverse action space across editable components, extensive trajectories that obscure actionable insights, and edits whose impact is difficult to trace. To address these obstacles, a team of researchers led by Jiahang Lin has introduced the concept of Agentic Harness Engineering (AHE). This closed-loop system comprises three interconnected observability pillars that aim to improve agent performance through autonomous evolution.

The Three Observability Pillars

The first pillar focuses on component observability, providing each editable harness component with a file-level representation. This allows for explicit and reversible actions within the diverse action space. The second pillar is experience observability which distills vast amounts of raw trajectory data into a layered evidence corpus that can be effectively consumed by an evolving agent. Lastly, decision observability pairs every edit with a self-declared prediction which is later validated against task-level outcomes in subsequent rounds.

Transforming Edits into Falsifiable Contracts

By transforming each edit into a falsifiable contract, AHE enables autonomous evolution without resorting to trial-and-error methods. This approach has shown promising results in improving pass@1 performance on Terminal-Bench 2 from 69.7% to 77.0%. It surpasses both human-designed harness Codex-CLI (71.9%) and self-evolving baselines ACE and TF-GRPO. Furthermore, the evolved harness can be transferred without requiring re-evolution: achieving superior success rates on SWE-bench-verified with 12% fewer tokens than the initial seed. Across three different model families on Terminal-Bench 2, cross-family gains ranging from +5.1 to +10.1pp indicate that the evolved components encapsulate general engineering experience rather than benchmark-specific tuning.

Collaborative Efforts and Empirical Results

The research paper "Agentic Harness Engineering: Observability Pillars for Autonomous Evolution" is a collaborative effort of authors including Jiahang Lin, Shichun Liu, Chengjun Pan, Lizhi Lin, Shihan Dou, Zhiheng Xi, Xuanjing Huang, Hang Yan, Zhenhua Han, Tao Gui, and Yu-Gang Jiang. Through their combined efforts and expertise in the field of coding-agent performance, they have made groundbreaking advancements in autonomous evolution. Empirical results from their experiments demonstrate the efficacy of AHE. The evolved harness showed significant improvements in pass@1 performance on Terminal-Bench 2 compared to human-designed harnesses and self-evolving baselines. This highlights the potential of AHE as a more efficient and effective approach to agent performance improvement.

Ablation Studies

Ablation studies were also conducted to pinpoint the source of improvement in agent performance. The results showed that tools, middleware, and long-term memory played crucial roles in enhancing performance rather than system prompts. This suggests that factual harness structure transfers effectively while prose-level strategies do not exhibit similar benefits.

Conclusion

In conclusion, Agentic Harness Engineering (AHE) has shown great promise in improving coding-agent performance through its three observability pillars - component observability, experience observability, and decision observability. By transforming edits into falsifiable contracts and enabling autonomous evolution without trial-and-error methods, AHE has surpassed both human-designed harnesses and self-evolving baselines in terms of pass@1 performance on Terminal-Bench 2. Collaborative efforts by researchers have led to these groundbreaking advancements which have the potential to revolutionize agent performance improvement in coding environments.

Created on 13 Jun. 2026

Assess the quality of the AI-generated content by voting

Score: 0

Similar papers summarized with our AI tools

81.3%

Natural-Language Agent Harnesses

cs.CL

78.6%

Auditing Agent Harness Safety

cs.CL

74.7%

Recursive Agent Harnesses

cs.CL

69.4%

Code as Agent Harness

cs.CL

65.5%

AutoHarness: improving LLM agents by automatically synthesizing a code harness

cs.CL

64.6%

Agent AI with LangGraph: A Modular Framework for Enhancing Machine Translation …

cs.CL

64.6%

Agentic AI for Scientific Discovery: A Survey of Progress, Challenges, and Fu…

cs.CL

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.