Self-Harness: Harnesses That Improve Themselves

AI-generated keywords: Language model-based agents

AI-generated Key Points

  • Performance of language model-based agents (LLMs) tied to base models and harnesses
  • Traditional methods of human experts engineering agent harnesses inefficient and unsustainable
  • Introduction of Self-Harness paradigm for LLM-based agents to autonomously enhance operating harness
  • Three key stages of Self-Harness: Weakness Mining, Harness Proposal, Proposal Validation
  • Implementation of Self-Harness on Terminal-Bench-2.0 with three distinct base models
  • Consistent enhancement in performance metrics and increased pass rates across all three models with Self-Harness
  • Qualitative analyses show transformation of weaknesses into tangible harness improvements
  • Promising trajectory towards LLM-based agents actively reshaping their harnesses for greater autonomy and agility
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Hangfan Zhang, Shao Zhang, Kangcong Li, Chen Zhang, Yang Chen, Yiqun Zhang, Lei Bai, Shuyue Hu

License: CC BY 4.0

Abstract: The performance of LLM-based agents is jointly shaped by their base models and the harnesses that mediate their interaction with the environment. Because different models exhibit distinct behaviors, effective harness design is inherently model-specific. Yet agent harnesses are still largely engineered by human experts, a paradigm that scales poorly as modern LLMs become increasingly diverse and rapidly evolving. In this paper, we introduce Self-Harness, a new paradigm in which an LLM-based agent improves its own operating harness, without relying on human engineers or stronger external agents. We operationalize Self-Harness as an iterative loop with three stages: Weakness Mining, which identifies model-specific failure patterns from execution traces; Harness Proposal, which generates diverse yet minimal harness modifications tied to these failures; and Proposal Validation, which accepts candidate edits only after regression testing. We instantiate Self-Harness on Terminal-Bench-2.0 using a minimal initial harness and three base models from diverse families: MiniMax M2.5, Qwen3.5-35B-A3B, and GLM-5. Across all three models, Self-Harness consistently improves performance, with held-out pass rates increasing from 40.5% to 61.9%, 23.8% to 38.1%, and 42.9% to 57.1%, respectively. Qualitative analyses further show that Self-Harness does not simply add generic instructions, but effectively turns model-specific weaknesses into concrete, executable harness changes. These results suggest a path toward LLM-based agents that are not merely shaped by their harnesses, but can also participate in reshaping them.

Submitted to arXiv on 08 Jun. 2026

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2606.09498v1

, , , , The performance of language model-based agents (LLMs) is closely tied to their underlying base models and the harnesses that facilitate their interactions with the environment. However, traditional methods of human experts engineering agent harnesses are inefficient and unsustainable as LLMs continue to diversify and evolve rapidly. To address this challenge, a groundbreaking paradigm known as Self-Harness has been introduced in this paper. This innovative approach allows LLM-based agents to autonomously enhance their operating harness without relying on external human engineers or stronger agents. It consists of three key stages: Weakness Mining, Harness Proposal, and Proposal Validation. The study implements Self-Harness on Terminal-Bench-2.0 using three distinct base models: MiniMax M2.5, Qwen3.5-35B-A3B, and GLM-5. Remarkably, across all three models, Self-Harness consistently enhances performance metrics and increases held-out pass rates significantly for each model variant. Qualitative analyses demonstrate that Self-Harness effectively transforms model-specific weaknesses into tangible and executable harness improvements. These findings signify a promising trajectory towards LLM-based agents that actively engage in reshaping their harnesses rather than being shaped by them. By enabling self-improvement based on identified weaknesses, Self-Harness offers a more efficient and adaptable solution in the face of rapidly evolving LLM technologies while fostering greater autonomy and agility within these systems.
Created on 23 Jun. 2026

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.