Teaching Large Language Models to Self-Debug
AI-generated Key Points
- Large language models (LLMs) have impressive performance in code generation, but generating correct solutions for complex programming tasks can be challenging.
- Prior works have designed program repair approaches to improve code generation performance.
- The authors propose a novel approach called Self-Debugging which teaches a large language model to debug its predicted program via few-shot demonstrations.
- The approach enables the model to perform rubber duck debugging without any feedback on code correctness or error messages.
- Self-Debugging achieves state-of-the-art performance on several code generation benchmarks including text-to-SQL generation, C++-to-Python translation and text-to-Python generation.
- On the Spider benchmark where there are no unit tests to verify prediction correctness, Self-Debugging with code explanation consistently improves the baseline by 2–3% and improves prediction accuracy on problems of the hardest label by 9%.
- On TransCoder and MBPP where unit tests are available, Self-Debugging improves baseline accuracy by up to 12%.
- By leveraging feedback messages and reusing failed predictions, Self Debugging notably improves sample efficiency and can match or outperform baseline models that generate more than 10x candidate programs.
- Future work includes improving the model's ability to conduct all these steps through better code explanation ability leading to better debugging performance.
Authors: Xinyun Chen, Maxwell Lin, Nathanael Schärli, Denny Zhou
Abstract: Large language models (LLMs) have achieved impressive performance on code generation. However, for complex programming tasks, generating the correct solution in one go becomes challenging, thus some prior works have designed program repair approaches to improve code generation performance. In this work, we propose Self-Debugging, which teaches a large language model to debug its predicted program via few-shot demonstrations. In particular, we demonstrate that Self-Debugging can teach the large language model to perform rubber duck debugging; i.e., without any feedback on the code correctness or error messages, the model is able to identify its mistakes by explaining the generated code in natural language. Self-Debugging achieves the state-of-the-art performance on several code generation benchmarks, including the Spider dataset for text-to-SQL generation, TransCoder for C++-to-Python translation, and MBPP for text-to-Python generation. On the Spider benchmark where there are no unit tests to verify the correctness of predictions, Self-Debugging with code explanation consistently improves the baseline by 2-3%, and improves the prediction accuracy on problems of the hardest label by 9%. On TransCoder and MBPP where unit tests are available, Self-Debugging improves the baseline accuracy by up to 12%. Meanwhile, by leveraging feedback messages and reusing failed predictions, Self-Debugging notably improves sample efficiency, and can match or outperform baseline models that generate more than 10x candidate programs.
Ask questions about this paper to our AI assistant
You can also chat with multiple papers at once here.
Assess the quality of the AI-generated content by voting
Why do we need votes?
Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.
Similar papers summarized with our AI tools
Navigate through even more similar papers through atree representation
Look for similar papers (in beta version)
By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.
Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.