Baldur: Whole-Proof Generation and Repair with Large Language Models

AI-generated keywords: Baldur Automated Proof Synthesis Formal Verification Transformers Repair Model

AI-generated Key Points

The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

  • Baldur is a new method for automating formal verification
  • It uses large language models to generate complete proofs for theorems in one go
  • Combines proof generation model with fine-tuned repair model to increase proving power
  • Whole-proof generation using transformers is possible and effective without costly search
  • Providing additional context such as prior failed proof attempt and error message can result in proof repair and further improve automated proof generation
  • Baldur evaluated on benchmark of 6,336 Isabelle/HOL theorems and their proofs, establishing new state of the art for fully automated proof synthesis
  • Baldur improves on state-of-the-art tool Thor by automatically generating proofs for an additional 8.7% of the theorems, together they can prove 65.7% of the theorems fully automatically.
  • This approach represents a major breakthrough in automating formal verification through automated proof synthesis.
  • Reduces laborious tasks associated with formal verification, improving software reliability and security in various domains where formal methods are used to ensure correctness.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Emily First, Markus N. Rabe, Talia Ringer, Yuriy Brun

Abstract: Formally verifying software properties is a highly desirable but labor-intensive task. Recent work has developed methods to automate formal verification using proof assistants, such as Coq and Isabelle/HOL, e.g., by training a model to predict one proof step at a time, and using that model to search through the space of possible proofs. This paper introduces a new method to automate formal verification: We use large language models, trained on natural language text and code and fine-tuned on proofs, to generate whole proofs for theorems at once, rather than one step at a time. We combine this proof generation model with a fine-tuned repair model to repair generated proofs, further increasing proving power. As its main contributions, this paper demonstrates for the first time that: (1) Whole-proof generation using transformers is possible and is as effective as search-based techniques without requiring costly search. (2) Giving the learned model additional context, such as a prior failed proof attempt and the ensuing error message, results in proof repair and further improves automated proof generation. (3) We establish a new state of the art for fully automated proof synthesis. We reify our method in a prototype, Baldur, and evaluate it on a benchmark of 6,336 Isabelle/HOL theorems and their proofs. In addition to empirically showing the effectiveness of whole-proof generation, repair, and added context, we show that Baldur improves on the state-of-the-art tool, Thor, by automatically generating proofs for an additional 8.7% of the theorems. Together, Baldur and Thor can prove 65.7% of the theorems fully automatically. This paper paves the way for new research into using large language models for automating formal verification.

Submitted to arXiv on 08 Mar. 2023

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2303.04910v1

This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

Baldur is a new method for automating formal verification that uses large language models to generate complete proofs for theorems in one go rather than step by step. This approach combines a proof generation model with a fine-tuned repair model to increase proving power. The authors demonstrate that whole-proof generation using transformers is possible and as effective as search-based techniques without requiring costly search. They also show that providing the learned model with additional context such as a prior failed proof attempt and the ensuing error message can result in proof repair and further improve automated proof generation. The paper establishes a new state of the art for fully automated proof synthesis by evaluating Baldur on a benchmark of 6,336 Isabelle/HOL theorems and their proofs. The authors empirically show the effectiveness of whole-proof generation, repair, and added context. They also demonstrate that Baldur improves on the state-of-the-art tool Thor by automatically generating proofs for an additional 8.7% of the theorems. Together, Baldur and Thor can prove 65.7% of the theorems fully automatically. This paper opens up new possibilities for research into using large language models for automating formal verification. The authors' approach represents a major breakthrough in this field by enabling faster and more efficient formal verification through automated proof synthesis. By reducing laborious tasks associated with formal verification, this method has important implications for improving software reliability and security in various domains where formal methods are used to ensure correctness.
Created on 16 May. 2023

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.