Ax-Prover: A Deep Reasoning Agentic Framework for Theorem Proving in Mathematics and Quantum Physics

AI-generated keywords: Automated theorem proving

AI-generated Key Points

  • Development of Large Language Models (LLMs) for reasoning across diverse scientific domains is a significant challenge in automated theorem proving
  • Specialized prover models derived from cutting-edge LLMs show impressive performance on math benchmarks but face limitations in adapting to evolving mathematical libraries
  • General-purpose LLMs like Claude and GPT possess broad knowledge spanning various domains, exhibit strong natural language understanding and problem-solving skills, but lack explicit training for formalizing statements or constructing proofs in Lean
  • Ax-Prover emerges as a multi-agent system designed for automated theorem proving in Lean, bridging the gap between specialized provers and general-purpose LLMs through the Model Context Protocol (MCP)
  • Ax-Prover demonstrates competitive performance on public math datasets and showcases superior capabilities on novel challenges, offering a generalizable methodology for formal verification across diverse scientific domains
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Marco Del Tredici, Jacob McCarran, Benjamin Breen, Javier Aspuru Mijares, Weichen Winston Yin, Jacob M. Taylor, Frank Koppens, Dirk Englund

License: CC BY 4.0

Abstract: We present Ax-Prover, a multi-agent system for automated theorem proving in Lean that can solve problems across diverse scientific domains and operate either autonomously or collaboratively with human experts. To achieve this, Ax-Prover approaches scientific problem solving through formal proof generation, a process that demands both creative reasoning and strict syntactic rigor. Ax-Prover meets this challenge by equipping Large Language Models (LLMs), which provide knowledge and reasoning, with Lean tools via the Model Context Protocol (MCP), which ensure formal correctness. To evaluate its performance as an autonomous prover, we benchmark our approach against frontier LLMs and specialized prover models on two public math benchmarks and on two Lean benchmarks we introduce in the fields of abstract algebra and quantum theory. On public datasets, Ax-Prover is competitive with state-of-the-art provers, while it largely outperform them on the new benchmarks. This shows that, unlike specialized systems that struggle to generalize, our tool-based agentic theorem prover approach offers a generalizable methodology for formal verification across diverse scientific domains. Furthermore, we demonstrate Ax-Prover's assistant capabilities in a practical use case, showing how it enabled an expert mathematician to formalize the proof of a complex cryptography theorem.

Submitted to arXiv on 14 Oct. 2025

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2510.12787v1

, , , , In the field of artificial intelligence, the development of Large Language Models (LLMs) capable of reasoning across diverse scientific domains is a significant challenge in automated theorem proving. While LLM-based formal reasoning systems have made remarkable progress in mathematics, such as with Lean - an open-source programming language and interactive proof assistant - there is still much work to be done to ensure their generalizability beyond specific domains. Specialized prover models derived from cutting-edge LLMs have shown impressive performance on math benchmarks like miniF2F and PutnamBench, but they face limitations in adapting to evolving mathematical libraries like Mathlib. On the other hand, general-purpose LLMs like Claude and GPT possess broad knowledge spanning various domains including mathematics, physics, and computer science. These models exhibit strong natural language understanding and problem-solving skills, making them easily deployable and integratable into different workflows through APIs. However, they lack explicit training for formalizing statements or constructing proofs in Lean, hindering their ability to interface with the formal reasoning infrastructure required for theorem proving. To bridge this gap between specialized provers and general-purpose LLMs, Ax-Prover emerges as a multi-agent system designed for automated theorem proving in Lean. By equipping LLMs with Lean tools via the Model Context Protocol (MCP), Ax-Prover enables formal proof generation that demands both creative reasoning and strict syntactic rigor. Through benchmarking against state-of-the-art provers on public math datasets as well as new benchmarks in abstract algebra and quantum theory fields introduced by researchers themselves, Ax-Prover demonstrates competitive performance while showcasing superior capabilities on novel challenges. Moreover, Ax-Prover's assistant capabilities are highlighted through a practical use case where it aids an expert mathematician in formalizing a complex cryptography theorem. This showcases how Ax-Prover's tool-based agentic approach offers a generalizable methodology for formal verification across diverse scientific domains while enabling collaboration between human experts and AI systems. Overall, Ax-Prover represents a promising advancement towards scalable and flexible automated theorem proving systems that can operate autonomously or collaboratively with human experts across various scientific disciplines.
Created on 15 Oct. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.