Teaching Large Language Models to Self-Debug

AI-generated keywords: Self-Debugging Code Generation Performance Feedback Messages Error Messages

AI-generated Key Points

  • Large language models (LLMs) have impressive performance in code generation, but generating correct solutions for complex programming tasks can be challenging.
  • Prior works have designed program repair approaches to improve code generation performance.
  • The authors propose a novel approach called Self-Debugging which teaches a large language model to debug its predicted program via few-shot demonstrations.
  • The approach enables the model to perform rubber duck debugging without any feedback on code correctness or error messages.
  • Self-Debugging achieves state-of-the-art performance on several code generation benchmarks including text-to-SQL generation, C++-to-Python translation and text-to-Python generation.
  • On the Spider benchmark where there are no unit tests to verify prediction correctness, Self-Debugging with code explanation consistently improves the baseline by 2–3% and improves prediction accuracy on problems of the hardest label by 9%.
  • On TransCoder and MBPP where unit tests are available, Self-Debugging improves baseline accuracy by up to 12%.
  • By leveraging feedback messages and reusing failed predictions, Self Debugging notably improves sample efficiency and can match or outperform baseline models that generate more than 10x candidate programs.
  • Future work includes improving the model's ability to conduct all these steps through better code explanation ability leading to better debugging performance.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Xinyun Chen, Maxwell Lin, Nathanael Schärli, Denny Zhou

License: CC BY 4.0

Abstract: Large language models (LLMs) have achieved impressive performance on code generation. However, for complex programming tasks, generating the correct solution in one go becomes challenging, thus some prior works have designed program repair approaches to improve code generation performance. In this work, we propose Self-Debugging, which teaches a large language model to debug its predicted program via few-shot demonstrations. In particular, we demonstrate that Self-Debugging can teach the large language model to perform rubber duck debugging; i.e., without any feedback on the code correctness or error messages, the model is able to identify its mistakes by explaining the generated code in natural language. Self-Debugging achieves the state-of-the-art performance on several code generation benchmarks, including the Spider dataset for text-to-SQL generation, TransCoder for C++-to-Python translation, and MBPP for text-to-Python generation. On the Spider benchmark where there are no unit tests to verify the correctness of predictions, Self-Debugging with code explanation consistently improves the baseline by 2-3%, and improves the prediction accuracy on problems of the hardest label by 9%. On TransCoder and MBPP where unit tests are available, Self-Debugging improves the baseline accuracy by up to 12%. Meanwhile, by leveraging feedback messages and reusing failed predictions, Self-Debugging notably improves sample efficiency, and can match or outperform baseline models that generate more than 10x candidate programs.

Submitted to arXiv on 11 Apr. 2023

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2304.05128v1

Large language models (LLMs) have shown impressive performance in code generation, but generating the correct solution for complex programming tasks can be challenging. To address this issue, prior works have designed program repair approaches to improve code generation performance. In this work, the authors propose a novel approach called Self-Debugging which teaches a large language model to debug its predicted program via few-shot demonstrations. The approach enables the model to perform rubber duck debugging without any feedback on code correctness or error messages. The model identifies its mistakes by explaining the generated code in natural language. The authors demonstrate that Self-Debugging achieves state-of-the-art performance on several code generation benchmarks including text-to-SQL generation, C++-to-Python translation and text-to-Python generation. On the Spider benchmark where there are no unit tests to verify prediction correctness, Self-Debugging with code explanation consistently improves the baseline by 2–3% and improves prediction accuracy on problems of the hardest label by 9%. On TransCoder and MBPP where unit tests are available, Self-Debugging improves baseline accuracy by up to 12%. By leveraging feedback messages and reusing failed predictions, Self Debugging notably improves sample efficiency and can match or outperform baseline models that generate more than 10x candidate programs. The authors highlight the promise of improving coding performance of large language models by teaching them to iteratively debug their own predictions instead of requiring them to generate correct code from scratch. SELF DEBUGGING instructs the model to understand the code identify errors and follow error messages to fix bugs. Future work includes improving the model's ability to conduct all these steps through better code explanation ability leading to better debugging performance. One direction is instructing models to describe high level semantic meaning of code along with implementation details in explanations while another is including additional debugging information in feedback such as potential bug descriptions. Preliminary results suggest that model generated feedback messages on semantic errors do not provide additional benefits on top of line–by–line code explanation and future work can explore techniques to predict more informative error messages.
Created on 12 Apr. 2023

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.