AutoHarness: improving LLM agents by automatically synthesizing a code harness

AI-generated keywords: Kaggle GameArena chess competition illegal moves language models code harnesses LLM agents

AI-generated Key Points

  • 78% of losses by the Gemini-2.5-Flash model in the Kaggle GameArena chess competition were due to illegal moves, highlighting a common issue with using language models as agents.
  • Researchers traditionally manually wrote "harnesses" around these models to prevent failures caused by illegal moves.
  • Xinghua Lou and team's study titled "AutoHarness" demonstrates that Gemini-2.5-Flash can automatically generate a code harness through iterative refinement based on game environment feedback.
  • The innovative approach resulted in a code harness successfully preventing all illegal moves in 145 TextArena games, enabling Gemini-2.5-Flash to outperform larger models like Gemini-2.5-Pro.
  • By further pushing the technique, researchers got Gemini-2.5-Flash to generate an entire policy in code, eliminating the need for using the language model at decision-making time.
  • The resulting code-policy received higher average rewards than larger models like Gemini-2.5-Pro and GPT-5.2-High on 16 TextArena single-player games while being cost-effective.
  • Challenges remain in developing a code world model for search purposes in text-based two-player games requiring strategic reasoning about opponents' policies, which could potentially be addressed through Monte Carlo Tree Search (MCTS) methods at runtime.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Xinghua Lou, Miguel Lázaro-Gredilla, Antoine Dedieu, Carter Wendelken, Wolfgang Lehrach, Kevin P. Murphy

agent harness, code synthesis, self-improvement, code-as-policy, text games
License: CC BY 4.0

Abstract: Despite significant strides in language models in the last few years, when used as agents, such models often try to perform actions that are not just suboptimal for a given state, but are strictly prohibited by the external environment. For example, in the recent Kaggle GameArena chess competition, 78% of Gemini-2.5-Flash losses were attributed to illegal moves. Often people manually write "harnesses" around LLMs to prevent such failures. In this paper, we demonstrate that Gemini-2.5-Flash can automatically synthesize such a code harness, using a small number of rounds of iterative code refinement given feedback from the (game) environment. The resulting harness prevents all illegal moves in 145 different TextArena games (both 1-player and 2-player), enabling the smaller Gemini-2.5-Flash model to outperform larger models, such as Gemini-2.5-Pro. Pushing our technique to the limit, we can get Gemini-2.5-Flash to generate the entire policy in code, thus eliminating the need to use the LLM at decision making time. The resulting code-policy receives a higher average reward than Gemini-2.5-Pro and GPT-5.2-High on 16 TextArena 1-player games. Our results show that using a smaller model to synthesize a custom code harness (or entire policy) can outperform a much larger model, while also being more cost effective.

Submitted to arXiv on 10 Feb. 2026

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2603.03329v1

In the recent Kaggle GameArena chess competition, it was discovered that 78% of losses by the Gemini-2.5-Flash model were due to illegal moves. This highlights a common issue with using language models as agents. To address this problem, researchers have traditionally manually written "harnesses" around these models to prevent such failures. However, in a groundbreaking study titled "AutoHarness: improving LLM agents by automatically synthesizing a code harness," Xinghua Lou and his team demonstrate that Gemini-2.5-Flash can automatically generate a code harness through iterative code refinement based on feedback from the game environment. This innovative approach resulted in the creation of a code harness that successfully prevented all illegal moves in 145 different TextArena games, including both single-player and two-player games. Surprisingly, this enabled the smaller Gemini-2.5-Flash model to outperform larger models like Gemini-2.5-Pro. By pushing the technique further, the researchers were able to get Gemini-2.5-Flash to generate an entire policy in code, eliminating the need for using the language model at decision-making time. The resulting code-policy not only received higher average rewards than Gemini-2.5-Pro and GPT-5.2-High on 16 TextArena single-player games but also showcased cost-effectiveness compared to using larger models. While two-player games may require strategic reasoning about opponents' policies, which could potentially be addressed through Monte Carlo Tree Search (MCTS) methods at runtime, challenges remain in developing a code world model for search purposes in text-based games. Overall, this study presents a promising avenue for enhancing LLM agents by automatically synthesizing code harnesses or entire policies. It showcases how smaller models can outperform larger ones while being more efficient and effective in various gaming scenarios.
Created on 13 Jun. 2026

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.