SWE-CI: Evaluating Agent Capabilities in Maintaining Codebases via Continuous Integration

AI-generated keywords: Software Engineering

AI-generated Key Points

Large Language Model (LLM)-powered agents have shown impressive capabilities in automating tasks like static bug fixing
Traditional one-shot repair paradigms are insufficient for handling long-term software development with requirement changes and feature iterations
SWE-CI benchmark shifts the evaluation paradigm from short-term functional correctness to long-term maintainability in code generation
SWE-CI is the first repository-level benchmark based on Continuous Integration, assessing agents' ability to sustain code quality over extended periods of evolution
The benchmark comprises 100 tasks from real-world repositories with an average evolution history of 233 days and 71 consecutive commits
SWE-CI focuses on dynamic maintainability, offering insights into an agent's ability to adapt and evolve code over time
EvoScore in SWE-CI allows nuanced assessment of coding capabilities by measuring performance on future modifications
State-of-the-art models excel in functional correctness but struggle with sustaining code quality over prolonged evolution periods

Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Jialong Chen, Xander Xu, Hu Wei, Chuan Chen, Bing Zhao

arXiv: 2603.03823v1 - DOI (cs.SE)

License: CC BY 4.0

Abstract: Large language model (LLM)-powered agents have demonstrated strong capabilities in automating software engineering tasks such as static bug fixing, as evidenced by benchmarks like SWE-bench. However, in the real world, the development of mature software is typically predicated on complex requirement changes and long-term feature iterations -- a process that static, one-shot repair paradigms fail to capture. To bridge this gap, we propose \textbf{SWE-CI}, the first repository-level benchmark built upon the Continuous Integration loop, aiming to shift the evaluation paradigm for code generation from static, short-term \textit{functional correctness} toward dynamic, long-term \textit{maintainability}. The benchmark comprises 100 tasks, each corresponding on average to an evolution history spanning 233 days and 71 consecutive commits in a real-world code repository. SWE-CI requires agents to systematically resolve these tasks through dozens of rounds of analysis and coding iterations. SWE-CI provides valuable insights into how well agents can sustain code quality throughout long-term evolution.

Submitted to arXiv on 04 Mar. 2026

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2603.03823v1

Comprehensive Summary
Key points
Layman's Summary
Blog article

, , , , In the realm of software engineering, Large Language Model (LLM)-powered agents have showcased impressive capabilities in automating tasks such as static bug fixing. This has been demonstrated by benchmarks like SWE-bench. However, the real-world scenario of developing mature software involves intricate requirement changes and long-term feature iterations. Traditional one-shot repair paradigms fail to capture this process. To address this gap, the innovative SWE-CI benchmark has been introduced, marking a shift in the evaluation paradigm for code generation from short-term functional correctness to long-term maintainability. SWE-CI stands out as the first repository-level benchmark built upon the Continuous Integration loop. Its aim is to assess how well agents can sustain code quality throughout extended periods of evolution. The benchmark comprises 100 tasks derived from real-world code repositories with an average evolution history spanning 233 days and 71 consecutive commits. SWE-CI challenges agents to systematically resolve these tasks through multiple rounds of analysis and coding iterations. By focusing on dynamic maintainability rather than static fixes, SWE-CI offers valuable insights into an agent's ability to adapt and evolve code over time. The motivation behind designing benchmarks like SWE-CI stems from the understanding that software quality naturally degrades over time as maintenance progresses. With maintenance activities accounting for a significant portion of total software lifecycle costs, there is a pressing need to evaluate models based on their capacity to maintain code effectively. The existing snapshot-style evaluation protocols used in benchmarks like HumanEval and LiveCodeBench overlook the crucial aspect of long-term code evolution. Agents that produce quick fixes may pass initial tests but struggle when faced with evolving requirements and changing interfaces. Through extensive experiments involving more than 10 billion tokens, it was observed that while state-of-the-art models excel in functional correctness tasks, they encounter challenges in sustaining code quality over prolonged evolution periods. The introduction of EvoScore as a proxy metric in SWE-CI enables a nuanced assessment of an agent's coding capabilities by measuring its performance on future modifications. This comprehensive evaluation approach sheds light on the distinctive diagnostic value of SWE-CI in gauging an agent's ability to maintain codebase integrity amidst evolving requirements. In conclusion, SWE-CI represents a groundbreaking initiative in evaluating LLM-based agents' long-term coding proficiency through continuous integration processes. By emphasizing maintainability alongside functional correctness, this benchmark offers valuable insights into how well agents can adapt and evolve codebases over extended periods of time.

- Large Language Model (LLM)-powered agents have shown impressive capabilities in automating tasks like static bug fixing
- Traditional one-shot repair paradigms are insufficient for handling long-term software development with requirement changes and feature iterations
- SWE-CI benchmark shifts the evaluation paradigm from short-term functional correctness to long-term maintainability in code generation
- SWE-CI is the first repository-level benchmark based on Continuous Integration, assessing agents' ability to sustain code quality over extended periods of evolution
- The benchmark comprises 100 tasks from real-world repositories with an average evolution history of 233 days and 71 consecutive commits
- SWE-CI focuses on dynamic maintainability, offering insights into an agent's ability to adapt and evolve code over time
- EvoScore in SWE-CI allows nuanced assessment of coding capabilities by measuring performance on future modifications
- State-of-the-art models excel in functional correctness but struggle with sustaining code quality over prolonged evolution periods

Summary- Big smart computer programs have gotten really good at fixing mistakes in computer code. - The old way of fixing mistakes all at once doesn't work well for making software that changes a lot. - A new test called SWE-CI looks at how well these programs can keep code working well as it changes over time. - This test uses real tasks from computer projects and sees how the programs handle them over many days and changes. - SWE-CI helps us see if these programs can keep up with changing code and make it better over time. Definitions- Large Language Model (LLM): A big, powerful computer program that can do many tasks on its own. - Static bug fixing: Correcting errors or mistakes in computer code without running the program. - Continuous Integration (CI): A practice in software development where changes are frequently integrated into the main project to prevent issues. - Maintainability: How easy it is to keep something working well over time, like software code.

Introduction

In recent years, Large Language Model (LLM)-powered agents have shown impressive capabilities in automating tasks such as static bug fixing. This has been demonstrated by benchmarks like SWE-bench. However, the real-world scenario of developing mature software involves intricate requirement changes and long-term feature iterations. Traditional one-shot repair paradigms fail to capture this process, leading to a gap in evaluating an agent's coding proficiency over extended periods of time. To address this issue, researchers have introduced the innovative SWE-CI benchmark. This benchmark marks a shift in the evaluation paradigm for code generation from short-term functional correctness to long-term maintainability. It is the first repository-level benchmark built upon the Continuous Integration loop and aims to assess how well agents can sustain code quality throughout extended periods of evolution.

The Motivation Behind SWE-CI

The motivation behind designing benchmarks like SWE-CI stems from the understanding that software quality naturally degrades over time as maintenance progresses. With maintenance activities accounting for a significant portion of total software lifecycle costs, there is a pressing need to evaluate models based on their capacity to maintain code effectively. Existing snapshot-style evaluation protocols used in benchmarks like HumanEval and LiveCodeBench overlook the crucial aspect of long-term code evolution. Agents that produce quick fixes may pass initial tests but struggle when faced with evolving requirements and changing interfaces.

The Importance of Long-Term Code Evolution

Through extensive experiments involving more than 10 billion tokens, it was observed that while state-of-the-art models excel in functional correctness tasks, they encounter challenges in sustaining code quality over prolonged evolution periods. This highlights the importance of evaluating an agent's ability to adapt and evolve codebases over extended periods of time.

What Sets SWE-CI Apart?

SWE-CI stands out as the first repository-level benchmark built upon the Continuous Integration loop. Its aim is to assess how well agents can sustain code quality throughout extended periods of evolution. The benchmark comprises 100 tasks derived from real-world code repositories with an average evolution history spanning 233 days and 71 consecutive commits.

A Shift in Evaluation Paradigm

The introduction of SWE-CI marks a shift in the evaluation paradigm for code generation from short-term functional correctness to long-term maintainability. By focusing on dynamic maintainability rather than static fixes, SWE-CI offers valuable insights into an agent's ability to adapt and evolve code over time.

EvoScore: A Comprehensive Metric

One of the key features of SWE-CI is the introduction of EvoScore as a proxy metric. This metric enables a nuanced assessment of an agent's coding capabilities by measuring its performance on future modifications. It takes into account not only initial fixes but also subsequent changes made to the codebase, providing a more comprehensive evaluation approach.

The Results

Conclusion

In conclusion, SWE-CI represents a groundbreaking initiative in evaluating LLM-based agents' long-term coding proficiency through continuous integration processes. By emphasizing maintainability alongside functional correctness, this benchmark offers valuable insights into how well agents can adapt and evolve codebases over extended periods of time. With its unique approach and comprehensive metrics, SWE-CI provides researchers with a powerful tool for assessing an agent's coding capabilities in real-world scenarios.

Created on 10 Mar. 2026

Assess the quality of the AI-generated content by voting

Score: 0

Similar papers summarized with our AI tools

61.8%

SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

cs.SE

59.7%

Evaluating AGENTS.md: Are Repository-Level Context Files Helpful for Coding Age…

cs.SE

55.1%

Agentless: Demystifying LLM-based Software Engineering Agents

cs.SE

54.3%

The Matthew Effect of AI Programming Assistants: A Hidden Bias in Software Ev…

cs.SE

52.5%

Vibe Coding vs. Agentic Coding: Fundamentals and Practical Implications of Ag…

cs.SE

52.0%

Moving Faster and Reducing Risk: Using LLMs in Release Deployment

cs.SE

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.