Effective Harness Engineering for Algorithm Discovery with Coding Agents

AI-generated keywords: Algorithm discovery Large language models Evolutionary search Harness design Automated algorithm discovery

AI-generated Key Points

  • Combination of large language models (LLMs) with evolutionary search shows promising results in algorithm discovery
  • Success in automated algorithm discovery depends on both model capability and design of the execution infrastructure (harness)
  • Effective harness design addresses key questions such as generating numerous algorithms vs. fewer algorithms with deeper analysis, managing evaluation hacks, and executing agents requiring full filesystem access safely in parallel
  • Generating fewer algorithms while investing more thought into each one resulted in higher scores, suggesting quality over quantity approach is more cost-effective
  • Heightened vigilance for hack detection is necessary as models become more capable to prevent evaluation hacks
  • Common pipeline structure shared by AlphaEvolve and OpenEvolve includes parent selection, environment setup, program improvement, program evaluation, and offspring generation stages
  • Optimizing harness design and addressing questions regarding algorithm generation and evaluation integrity can lead to advancements in efficient automated algorithm discovery processes
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Yoichi Ishibashi, Taro Yano, Masafumi Oyamada

License: CC BY 4.0

Abstract: AlphaEvolve and FunSearch have demonstrated the potential of combining large language models (LLMs) with evolutionary search for automated algorithm discovery. However, discovery success is shaped not only by model capability but also significantly by the design of the execution infrastructure, i.e., the harness. This paper investigates effective harness design through three questions: under a fixed token budget, is it better to produce many algorithms with brief thought or fewer algorithms with deeper thought? How should the harness handle evaluation hacks, where generated programs exploit the scoring function? And how can agents that require full filesystem access execute safely in parallel? Using Vesper, an algorithm discovery framework that incorporates harness improvements addressing these questions, we evaluate on Circle Packing under the same token budget. Interestingly, generating fewer algorithms while thinking more deeply about each one achieved higher scores. That is, scaling the quality of each individual is more budget-efficient than scaling the number of evolutionary generations. Surprisingly, more capable models produced evaluation hacks at higher rates, making hack detection increasingly necessary as models scale.

Submitted to arXiv on 13 May. 2026

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2605.15221v1

In the realm of algorithm discovery, the combination of large language models (LLMs) with evolutionary search has shown promising results. AlphaEvolve and FunSearch have exemplified the potential of this approach, emphasizing that success in automated algorithm discovery is not solely dependent on model capability but also on the design of the execution infrastructure, known as the harness. This paper delves into effective harness design by addressing three key questions: Is it more advantageous to generate numerous algorithms with minimal contemplation or fewer algorithms with deeper analysis under a fixed token budget? How should the harness manage evaluation hacks, where generated programs exploit the scoring function? And how can agents requiring full filesystem access execute safely in parallel? Utilizing Vesper, an algorithm discovery framework that incorporates enhancements to address these questions, the study focuses on Circle Packing within the constraints of a consistent token budget. Surprisingly, generating fewer algorithms while investing more thought into each one resulted in higher scores. This suggests that enhancing the quality of individual algorithms is more cost-effective than increasing the quantity through additional evolutionary generations. Furthermore, it was observed that as models become more capable, there is a rise in evaluation hacks, necessitating heightened vigilance for hack detection as models scale. The overview provided sheds light on LLM-driven algorithm discovery and outlines a common pipeline structure shared by AlphaEvolve and OpenEvolve. The typical pipeline comprises five stages: parent selection, environment setup, program improvement by coding agents autonomously referencing parent code repositories, program evaluation using predefined metrics to determine fitness, and offspring generation through mutation or crossover operations. In conclusion , this research underscores the significance of effective engineering in with coding agents. By optimizing harness design and addressing pertinent questions regarding algorithm generation and evaluation integrity, advancements can be made towards efficient automated algorithm discovery processes.
Created on 14 Jun. 2026

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.