AutoHarness: improving LLM agents by automatically synthesizing a code harness

AI-generated keywords: Kaggle GameArena chess competition illegal moves language models code harnesses LLM agents

AI-generated Key Points

78% of losses by the Gemini-2.5-Flash model in the Kaggle GameArena chess competition were due to illegal moves, highlighting a common issue with using language models as agents.
Researchers traditionally manually wrote "harnesses" around these models to prevent failures caused by illegal moves.
Xinghua Lou and team's study titled "AutoHarness" demonstrates that Gemini-2.5-Flash can automatically generate a code harness through iterative refinement based on game environment feedback.
The innovative approach resulted in a code harness successfully preventing all illegal moves in 145 TextArena games, enabling Gemini-2.5-Flash to outperform larger models like Gemini-2.5-Pro.
By further pushing the technique, researchers got Gemini-2.5-Flash to generate an entire policy in code, eliminating the need for using the language model at decision-making time.
The resulting code-policy received higher average rewards than larger models like Gemini-2.5-Pro and GPT-5.2-High on 16 TextArena single-player games while being cost-effective.
Challenges remain in developing a code world model for search purposes in text-based two-player games requiring strategic reasoning about opponents' policies, which could potentially be addressed through Monte Carlo Tree Search (MCTS) methods at runtime.

Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Xinghua Lou, Miguel Lázaro-Gredilla, Antoine Dedieu, Carter Wendelken, Wolfgang Lehrach, Kevin P. Murphy

arXiv: 2603.03329v1 - DOI (cs.CL)

agent harness, code synthesis, self-improvement, code-as-policy, text games

License: CC BY 4.0

Abstract: Despite significant strides in language models in the last few years, when used as agents, such models often try to perform actions that are not just suboptimal for a given state, but are strictly prohibited by the external environment. For example, in the recent Kaggle GameArena chess competition, 78% of Gemini-2.5-Flash losses were attributed to illegal moves. Often people manually write "harnesses" around LLMs to prevent such failures. In this paper, we demonstrate that Gemini-2.5-Flash can automatically synthesize such a code harness, using a small number of rounds of iterative code refinement given feedback from the (game) environment. The resulting harness prevents all illegal moves in 145 different TextArena games (both 1-player and 2-player), enabling the smaller Gemini-2.5-Flash model to outperform larger models, such as Gemini-2.5-Pro. Pushing our technique to the limit, we can get Gemini-2.5-Flash to generate the entire policy in code, thus eliminating the need to use the LLM at decision making time. The resulting code-policy receives a higher average reward than Gemini-2.5-Pro and GPT-5.2-High on 16 TextArena 1-player games. Our results show that using a smaller model to synthesize a custom code harness (or entire policy) can outperform a much larger model, while also being more cost effective.

Submitted to arXiv on 10 Feb. 2026

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2603.03329v1

Comprehensive Summary
Key points
Layman's Summary
Blog article

In the recent Kaggle GameArena chess competition, it was discovered that 78% of losses by the Gemini-2.5-Flash model were due to illegal moves. This highlights a common issue with using language models as agents. To address this problem, researchers have traditionally manually written "harnesses" around these models to prevent such failures. However, in a groundbreaking study titled "AutoHarness: improving LLM agents by automatically synthesizing a code harness," Xinghua Lou and his team demonstrate that Gemini-2.5-Flash can automatically generate a code harness through iterative code refinement based on feedback from the game environment. This innovative approach resulted in the creation of a code harness that successfully prevented all illegal moves in 145 different TextArena games, including both single-player and two-player games. Surprisingly, this enabled the smaller Gemini-2.5-Flash model to outperform larger models like Gemini-2.5-Pro. By pushing the technique further, the researchers were able to get Gemini-2.5-Flash to generate an entire policy in code, eliminating the need for using the language model at decision-making time. The resulting code-policy not only received higher average rewards than Gemini-2.5-Pro and GPT-5.2-High on 16 TextArena single-player games but also showcased cost-effectiveness compared to using larger models. While two-player games may require strategic reasoning about opponents' policies, which could potentially be addressed through Monte Carlo Tree Search (MCTS) methods at runtime, challenges remain in developing a code world model for search purposes in text-based games. Overall, this study presents a promising avenue for enhancing LLM agents by automatically synthesizing code harnesses or entire policies. It showcases how smaller models can outperform larger ones while being more efficient and effective in various gaming scenarios.

- 78% of losses by the Gemini-2.5-Flash model in the Kaggle GameArena chess competition were due to illegal moves, highlighting a common issue with using language models as agents.
- Researchers traditionally manually wrote "harnesses" around these models to prevent failures caused by illegal moves.
- Xinghua Lou and team's study titled "AutoHarness" demonstrates that Gemini-2.5-Flash can automatically generate a code harness through iterative refinement based on game environment feedback.
- The innovative approach resulted in a code harness successfully preventing all illegal moves in 145 TextArena games, enabling Gemini-2.5-Flash to outperform larger models like Gemini-2.5-Pro.
- By further pushing the technique, researchers got Gemini-2.5-Flash to generate an entire policy in code, eliminating the need for using the language model at decision-making time.
- The resulting code-policy received higher average rewards than larger models like Gemini-2.5-Pro and GPT-5.2-High on 16 TextArena single-player games while being cost-effective.
- Challenges remain in developing a code world model for search purposes in text-based two-player games requiring strategic reasoning about opponents' policies, which could potentially be addressed through Monte Carlo Tree Search (MCTS) methods at runtime.

SummaryIn a chess competition, a computer program called Gemini-2.5-Flash lost many games because it made illegal moves. People usually had to manually fix these mistakes in the past. But now, a new method called "AutoHarness" helps Gemini-2.5-Flash learn from its errors and play better. With this new technique, Gemini-2.5-Flash can make good moves without making any mistakes like before. This improvement helped Gemini-2.5-Flash win more games and be better than other big computer models. Definitions1. Illegal moves: Actions in the game that are against the rules. 2. Models: Computer programs designed to perform specific tasks or solve problems. 3. Harnesses: Protective measures put in place to prevent failures or mistakes. 4. Code harness: A set of instructions that guide a computer program on how to behave in certain situations. 5. Policy: A set of rules or guidelines that dictate decision-making processes. 6. Monte Carlo Tree Search (MCTS): A method used in artificial intelligence for decision-making and strategic planning based on random sampling and analysis.

In recent years, language models have been making headlines for their impressive performance in various tasks such as text generation and question-answering. However, when it comes to using these models as agents in gaming scenarios, there is a common issue that often arises - illegal moves. This was highlighted in the recent Kaggle GameArena chess competition where it was found that 78% of losses by the Gemini-2.5-Flash model were due to illegal moves. This problem has been addressed by researchers through the use of "harnesses" - manually written code that surrounds the language model and prevents it from making illegal moves. While this approach has been effective, it can be time-consuming and may not always result in optimal performance. In a groundbreaking study titled "AutoHarness: improving LLM agents by automatically synthesizing a code harness," Xinghua Lou and his team present an innovative solution to this problem. The researchers demonstrate how Gemini-2.5-Flash can automatically generate a code harness through iterative code refinement based on feedback from the game environment. This means that instead of relying on manual coding, the model learns from its mistakes and improves its performance over time. The results were impressive - the generated code harness successfully prevented all illegal moves in 145 different TextArena games, including both single-player and two-player games. What's even more surprising is that this approach enabled the smaller Gemini-2.5-Flash model to outperform larger models like Gemini-2.5-Pro. By pushing this technique further, the researchers were able to get Gemini-2.5-Flash to generate an entire policy in code, eliminating the need for using the language model at decision-making time. The resulting code-policy not only received higher average rewards than larger models like Gemini-2.5-Pro and GPT-5.2-High on 16 TextArena single-player games but also showcased cost-effectiveness compared to using these larger models. This means that not only is the smaller model more efficient, but it also performs better in various gaming scenarios. While this study presents a promising avenue for enhancing LLM agents by automatically synthesizing code harnesses or entire policies, there are still challenges that need to be addressed. For instance, two-player games may require strategic reasoning about opponents' policies, which could potentially be addressed through Monte Carlo Tree Search (MCTS) methods at runtime. Additionally, developing a code world model for search purposes in text-based games remains a challenge. In conclusion, this study highlights the potential of using language models as agents in gaming scenarios and how they can be enhanced through automatic code generation. It showcases how smaller models can outperform larger ones while being more efficient and effective in various gaming scenarios. With further advancements and research in this area, we may see even more impressive results from language models as game-playing agents in the future.

Created on 13 Jun. 2026

Assess the quality of the AI-generated content by voting

Score: 0

Similar papers summarized with our AI tools

60.5%

Code as Agent Harness

cs.CL

57.3%

From Code to Correctness: Closing the Last Mile of Code Generation with Hiera…

cs.CL

55.7%

Gorilla: Large Language Model Connected with Massive APIs

cs.CL

54.7%

Improving Language Model Negotiation with Self-Play and In-Context Learning f…

cs.CL

54.6%

ScribeAgent: Towards Specialized Web Agents Using Production-Scale Workflow D…

cs.CL

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.