SWE-CI: Evaluating Agent Capabilities in Maintaining Codebases via Continuous Integration

AI-generated keywords: Software Engineering

AI-generated Key Points

  • Large Language Model (LLM)-powered agents have shown impressive capabilities in automating tasks like static bug fixing
  • Traditional one-shot repair paradigms are insufficient for handling long-term software development with requirement changes and feature iterations
  • SWE-CI benchmark shifts the evaluation paradigm from short-term functional correctness to long-term maintainability in code generation
  • SWE-CI is the first repository-level benchmark based on Continuous Integration, assessing agents' ability to sustain code quality over extended periods of evolution
  • The benchmark comprises 100 tasks from real-world repositories with an average evolution history of 233 days and 71 consecutive commits
  • SWE-CI focuses on dynamic maintainability, offering insights into an agent's ability to adapt and evolve code over time
  • EvoScore in SWE-CI allows nuanced assessment of coding capabilities by measuring performance on future modifications
  • State-of-the-art models excel in functional correctness but struggle with sustaining code quality over prolonged evolution periods
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Jialong Chen, Xander Xu, Hu Wei, Chuan Chen, Bing Zhao

License: CC BY 4.0

Abstract: Large language model (LLM)-powered agents have demonstrated strong capabilities in automating software engineering tasks such as static bug fixing, as evidenced by benchmarks like SWE-bench. However, in the real world, the development of mature software is typically predicated on complex requirement changes and long-term feature iterations -- a process that static, one-shot repair paradigms fail to capture. To bridge this gap, we propose \textbf{SWE-CI}, the first repository-level benchmark built upon the Continuous Integration loop, aiming to shift the evaluation paradigm for code generation from static, short-term \textit{functional correctness} toward dynamic, long-term \textit{maintainability}. The benchmark comprises 100 tasks, each corresponding on average to an evolution history spanning 233 days and 71 consecutive commits in a real-world code repository. SWE-CI requires agents to systematically resolve these tasks through dozens of rounds of analysis and coding iterations. SWE-CI provides valuable insights into how well agents can sustain code quality throughout long-term evolution.

Submitted to arXiv on 04 Mar. 2026

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2603.03823v1

, , , , In the realm of software engineering, Large Language Model (LLM)-powered agents have showcased impressive capabilities in automating tasks such as static bug fixing. This has been demonstrated by benchmarks like SWE-bench. However, the real-world scenario of developing mature software involves intricate requirement changes and long-term feature iterations. Traditional one-shot repair paradigms fail to capture this process. To address this gap, the innovative SWE-CI benchmark has been introduced, marking a shift in the evaluation paradigm for code generation from short-term functional correctness to long-term maintainability. SWE-CI stands out as the first repository-level benchmark built upon the Continuous Integration loop. Its aim is to assess how well agents can sustain code quality throughout extended periods of evolution. The benchmark comprises 100 tasks derived from real-world code repositories with an average evolution history spanning 233 days and 71 consecutive commits. SWE-CI challenges agents to systematically resolve these tasks through multiple rounds of analysis and coding iterations. By focusing on dynamic maintainability rather than static fixes, SWE-CI offers valuable insights into an agent's ability to adapt and evolve code over time. The motivation behind designing benchmarks like SWE-CI stems from the understanding that software quality naturally degrades over time as maintenance progresses. With maintenance activities accounting for a significant portion of total software lifecycle costs, there is a pressing need to evaluate models based on their capacity to maintain code effectively. The existing snapshot-style evaluation protocols used in benchmarks like HumanEval and LiveCodeBench overlook the crucial aspect of long-term code evolution. Agents that produce quick fixes may pass initial tests but struggle when faced with evolving requirements and changing interfaces. Through extensive experiments involving more than 10 billion tokens, it was observed that while state-of-the-art models excel in functional correctness tasks, they encounter challenges in sustaining code quality over prolonged evolution periods. The introduction of EvoScore as a proxy metric in SWE-CI enables a nuanced assessment of an agent's coding capabilities by measuring its performance on future modifications. This comprehensive evaluation approach sheds light on the distinctive diagnostic value of SWE-CI in gauging an agent's ability to maintain codebase integrity amidst evolving requirements. In conclusion, SWE-CI represents a groundbreaking initiative in evaluating LLM-based agents' long-term coding proficiency through continuous integration processes. By emphasizing maintainability alongside functional correctness, this benchmark offers valuable insights into how well agents can adapt and evolve codebases over extended periods of time.
Created on 10 Mar. 2026

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.