AutoHarness: improving LLM agents by automatically synthesizing a code harness
AI-generated Key Points
- 78% of losses by the Gemini-2.5-Flash model in the Kaggle GameArena chess competition were due to illegal moves, highlighting a common issue with using language models as agents.
- Researchers traditionally manually wrote "harnesses" around these models to prevent failures caused by illegal moves.
- Xinghua Lou and team's study titled "AutoHarness" demonstrates that Gemini-2.5-Flash can automatically generate a code harness through iterative refinement based on game environment feedback.
- The innovative approach resulted in a code harness successfully preventing all illegal moves in 145 TextArena games, enabling Gemini-2.5-Flash to outperform larger models like Gemini-2.5-Pro.
- By further pushing the technique, researchers got Gemini-2.5-Flash to generate an entire policy in code, eliminating the need for using the language model at decision-making time.
- The resulting code-policy received higher average rewards than larger models like Gemini-2.5-Pro and GPT-5.2-High on 16 TextArena single-player games while being cost-effective.
- Challenges remain in developing a code world model for search purposes in text-based two-player games requiring strategic reasoning about opponents' policies, which could potentially be addressed through Monte Carlo Tree Search (MCTS) methods at runtime.
Authors: Xinghua Lou, Miguel Lázaro-Gredilla, Antoine Dedieu, Carter Wendelken, Wolfgang Lehrach, Kevin P. Murphy
Abstract: Despite significant strides in language models in the last few years, when used as agents, such models often try to perform actions that are not just suboptimal for a given state, but are strictly prohibited by the external environment. For example, in the recent Kaggle GameArena chess competition, 78% of Gemini-2.5-Flash losses were attributed to illegal moves. Often people manually write "harnesses" around LLMs to prevent such failures. In this paper, we demonstrate that Gemini-2.5-Flash can automatically synthesize such a code harness, using a small number of rounds of iterative code refinement given feedback from the (game) environment. The resulting harness prevents all illegal moves in 145 different TextArena games (both 1-player and 2-player), enabling the smaller Gemini-2.5-Flash model to outperform larger models, such as Gemini-2.5-Pro. Pushing our technique to the limit, we can get Gemini-2.5-Flash to generate the entire policy in code, thus eliminating the need to use the LLM at decision making time. The resulting code-policy receives a higher average reward than Gemini-2.5-Pro and GPT-5.2-High on 16 TextArena 1-player games. Our results show that using a smaller model to synthesize a custom code harness (or entire policy) can outperform a much larger model, while also being more cost effective.
Ask questions about this paper to our AI assistant
You can also chat with multiple papers at once here.
Assess the quality of the AI-generated content by voting
Score: 0
Why do we need votes?
Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.
Similar papers summarized with our AI tools
Navigate through even more similar papers through a
tree representationLook for similar papers (in beta version)
By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.
Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.