SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

AI-generated keywords: Language models SWE-bench software engineering evaluation framework advancements

AI-generated Key Points

Language models have advanced rapidly, surpassing our ability to evaluate them effectively.
Real-world software engineering has become a valuable testbed for assessing the next generation of language models.
SWE-bench is an evaluation framework with 2,294 software engineering problems sourced from GitHub issues and pull requests across 12 Python repositories.
Models tasked with editing codebases to address specific issues described in problem statements require understanding and coordinating changes across multiple functions, classes, and files simultaneously.
Cutting-edge proprietary models and fine-tuned models struggle to resolve even simple issues, with the most successful model achieving a mere 1.96% success rate.
Progress on this benchmark signifies advancements towards more practical, intelligent, and autonomous language models.
The source code has been anonymized and organized into separate directories for reproducibility purposes.
Plans are in place to release SWE-bench as an open-source repository with comprehensive documentation outlining its structure and usage.
The dataset's continually updatable nature allows for ongoing evaluation on new task instances created after model training dates.
Evaluation measures ensure proposed solutions not only address the stated issue but also maintain prior functionality through numerous tests.

Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, Karthik Narasimhan

arXiv: 2310.06770v3 - DOI (cs.CL)

Data, code, and leaderboard are available at https://www.swebench.com ICLR 2024, https://openreview.net/forum?id=VTF8yNQM66

License: CC BY 4.0

Abstract: Language models have outpaced our ability to evaluate them effectively, but for their future development it is essential to study the frontier of their capabilities. We find real-world software engineering to be a rich, sustainable, and challenging testbed for evaluating the next generation of language models. To this end, we introduce SWE-bench, an evaluation framework consisting of $2,294$ software engineering problems drawn from real GitHub issues and corresponding pull requests across $12$ popular Python repositories. Given a codebase along with a description of an issue to be resolved, a language model is tasked with editing the codebase to address the issue. Resolving issues in SWE-bench frequently requires understanding and coordinating changes across multiple functions, classes, and even files simultaneously, calling for models to interact with execution environments, process extremely long contexts and perform complex reasoning that goes far beyond traditional code generation tasks. Our evaluations show that both state-of-the-art proprietary models and our fine-tuned model SWE-Llama can resolve only the simplest issues. The best-performing model, Claude 2, is able to solve a mere $1.96$% of the issues. Advances on SWE-bench represent steps towards LMs that are more practical, intelligent, and autonomous.

Submitted to arXiv on 10 Oct. 2023

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2310.06770v3

Comprehensive Summary
Key points
Layman's Summary
Blog article

Language models have advanced at a rapid pace, surpassing our ability to effectively evaluate them. To further their development, it is crucial to explore the frontier of their capabilities. Real-world software engineering has emerged as a valuable testbed for assessing the next generation of language models. Introducing SWE-bench, an evaluation framework comprising 2,294 software engineering problems sourced from actual GitHub issues and pull requests across 12 prominent Python repositories. In , language models are tasked with editing codebases to address specific issues described in problem statements. These tasks often require understanding and coordinating changes across multiple functions, classes, and even files simultaneously. Models must interact with execution environments, process extensive contexts, and engage in complex reasoning beyond traditional code generation tasks. Evaluation results reveal that both cutting-edge proprietary models and our fine-tuned model struggle to resolve even the simplest issues. The most successful model, , achieves a mere 1.96% success rate in solving problems on . Progress on this benchmark signifies advancements towards more practical, intelligent, and autonomous language models. For reproducibility purposes, the entirety of the source code has been anonymized and organized into separate directories corresponding to different sections of the paper. Inline documentation explains the purpose and usage of various components within the codebase. Additionally, all 2,294 task instances are included alongside technical details on dataset collection and evaluation procedures. Moving forward, there are plans to release as an open-source repository with comprehensive documentation outlining the benchmark's structure and usage. The collection framework will be part of this open-sourced codebase for easy maintenance and reproducibility. The unique features of set it apart from traditional NLP benchmarks by offering realistic software engineering tasks that demand sophisticated skills akin to those possessed by experienced engineers. The dataset's continually updatable nature allows for ongoing evaluation on new task instances created after model training dates. Issue descriptions are detailed while codebases are vast, requiring models to identify specific lines for edits amidst extensive context. Robust evaluation measures ensure that proposed solutions not only address the stated issue but also maintain prior functionality through numerous tests. Cross-context code editing challenges traditional constraints by necessitating edits across various parts of a codebase rather than limiting modifications to individual functions or files. Overall, advancements on represent significant strides towards enhancing language models' practical applicability in real-world software engineering scenarios through improved intelligence and autonomy capabilities.

- Language models have advanced rapidly, surpassing our ability to evaluate them effectively.
- Real-world software engineering has become a valuable testbed for assessing the next generation of language models.
- SWE-bench is an evaluation framework with 2,294 software engineering problems sourced from GitHub issues and pull requests across 12 Python repositories.
- Models tasked with editing codebases to address specific issues described in problem statements require understanding and coordinating changes across multiple functions, classes, and files simultaneously.
- Cutting-edge proprietary models and fine-tuned models struggle to resolve even simple issues, with the most successful model achieving a mere 1.96% success rate.
- Progress on this benchmark signifies advancements towards more practical, intelligent, and autonomous language models.
- The source code has been anonymized and organized into separate directories for reproducibility purposes.
- Plans are in place to release SWE-bench as an open-source repository with comprehensive documentation outlining its structure and usage.
- The dataset's continually updatable nature allows for ongoing evaluation on new task instances created after model training dates.
- Evaluation measures ensure proposed solutions not only address the stated issue but also maintain prior functionality through numerous tests.

Summary- Language models have improved a lot, but it's hard to check how good they are. - People use real-world software problems to test new language models. - SWE-bench is a tool with many software issues from GitHub to test models. - Models need to understand and fix code problems in different parts of a program at the same time. - Even advanced models struggle with simple tasks, showing there's still room for improvement. Definitions- Language models: Programs that can understand and generate human language. - Software engineering: Creating and maintaining computer programs. - Evaluation framework: A system used to test and measure how well something works. - Codebases: Collections of code that make up a program or software project. - Proprietary models: Models owned by specific companies or organizations.

Language models have been a hot topic in the field of natural language processing (NLP) for quite some time now. These models, which are designed to understand and generate human language, have advanced at an astonishing pace in recent years. However, with this rapid advancement comes a new challenge - how do we effectively evaluate these increasingly complex language models? In order to further their development and push the boundaries of their capabilities, it is crucial to explore new frontiers and test them in real-world scenarios. This is where SWE-bench comes into play - an evaluation framework specifically designed for assessing the next generation of language models using real-world software engineering problems. SWE-bench comprises 2,294 software engineering problems sourced from actual GitHub issues and pull requests across 12 prominent Python repositories. The tasks assigned to the language models involve editing codebases to address specific issues described in problem statements. These tasks often require understanding and coordinating changes across multiple functions, classes, and even files simultaneously - a task that goes beyond traditional code generation tasks. One of the key challenges faced by these language models is interacting with execution environments while processing extensive contexts and engaging in complex reasoning. This requires them to possess sophisticated skills akin to those possessed by experienced engineers. The results from evaluating various cutting-edge proprietary models as well as a fine-tuned model on SWE-bench reveal that they struggle to resolve even the simplest issues. The most successful model achieved a mere 1.96% success rate on this benchmark - highlighting the need for further advancements towards more practical, intelligent, and autonomous language models. For reproducibility purposes, all source code used in this research has been anonymized and organized into separate directories corresponding to different sections of the paper. Inline documentation explains the purpose and usage of various components within the codebase. Additionally, all 2,294 task instances are included alongside technical details on dataset collection and evaluation procedures. This allows for easy replication of the study and further analysis by other researchers. Moving forward, there are plans to release SWE-bench as an open-source repository with comprehensive documentation outlining the benchmark's structure and usage. The collection framework will also be part of this open-sourced codebase for easy maintenance and reproducibility. What sets SWE-bench apart from traditional NLP benchmarks is its focus on offering realistic software engineering tasks that demand sophisticated skills. The dataset is continually updatable, allowing for ongoing evaluation on new task instances created after model training dates. This ensures that language models are tested on a diverse range of problems, keeping up with the constantly evolving nature of software engineering. Moreover, issue descriptions in SWE-bench are detailed while codebases are vast - requiring models to identify specific lines for edits amidst extensive context. Robust evaluation measures ensure that proposed solutions not only address the stated issue but also maintain prior functionality through numerous tests. Another unique aspect of SWE-bench is its cross-context code editing feature which challenges traditional constraints by necessitating edits across various parts of a codebase rather than limiting modifications to individual functions or files. This allows for a more comprehensive assessment of a language model's capabilities in handling complex real-world scenarios. In conclusion, advancements made on SWE-bench represent significant strides towards enhancing language models' practical applicability in real-world software engineering scenarios through improved intelligence and autonomy capabilities. By providing a realistic testbed for evaluating these models, we can continue to push their boundaries and pave the way for even more advanced language models in the future.

Created on 30 Jan. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Similar papers summarized with our AI tools

60.9%

ChipNeMo: Domain-Adapted LLMs for Chip Design

cs.CL

59.4%

TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks

cs.CL

57.9%

Table Meets LLM: Can Large Language Models Understand Structured Table Data? …

cs.CL

56.7%

A Comprehensive Overview of Large Language Models

cs.CL

56.4%

DIN-SQL: Decomposed In-Context Learning of Text-to-SQL with Self-Correction

cs.CL

55.9%

Demystifying GPT Self-Repair for Code Generation

cs.CL

55.8%

Better Synthetic Data by Retrieving and Transforming Existing Datasets

cs.CL

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.