cAST: Enhancing Code Retrieval-Augmented Generation with Structural Chunking via Abstract Syntax Tree

AI-generated keywords: Retrieval-Augmented Generation (RAG)

AI-generated Key Points

  • Retrieval-Augmented Generation (RAG) enhances large-scale code generation tasks by grounding predictions in external code corpora
  • Traditional line-based chunking methods disrupt semantic structures, leading to a degradation in the quality of generated code
  • Chunking via Abstract Syntax Trees (CAST) breaks down large AST nodes into smaller, more manageable chunks and consolidates sibling nodes while adhering to predefined size limits
  • CAST method generates self-contained and semantically coherent units across various programming languages and tasks, enhancing performance on diverse code generation tasks
  • CAST has been shown to boost Recall@5 by 4.3 points on RepoEval retrieval and Pass@1 by 2.67 points on SWE-bench generation
  • Structure-aware chunking is crucial for scaling up retrieval-enhanced code intelligence
  • Design goals for CAST include maintaining syntactic integrity, maximizing information density within each chunk, ensuring language invariance, and enabling seamless integration within existing RAG pipelines
  • AST Parsing supports syntax-aware chunking, accurately identifying and extracting meaningful chunks from source code while preserving its structural formatting
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Yilin Zhang, Xinran Zhao, Zora Zhiruo Wang, Chenyang Yang, Jiayi Wei, Tongshuang Wu

License: CC BY-SA 4.0

Abstract: Retrieval-Augmented Generation (RAG) has become essential for large-scale code generation, grounding predictions in external code corpora to improve actuality. However, a critical yet underexplored aspect of RAG pipelines is chunking -- the process of dividing documents into retrievable units. Existing line-based chunking heuristics often break semantic structures, splitting functions or merging unrelated code, which can degrade generation quality. We propose chunking via Abstract Syntax Trees (\ourwork), a structure-aware method that recursively breaks large AST nodes into smaller chunks and merges sibling nodes while respecting size limits. This approach generates self-contained, semantically coherent units across programming languages and tasks, improving performance on diverse code generation tasks, e.g., boosting Recall@5 by 4.3 points on RepoEval retrieval and Pass@1 by 2.67 points on SWE-bench generation. Our work highlights the importance of structure-aware chunking for scaling retrieval-enhanced code intelligence.

Submitted to arXiv on 18 Jun. 2025

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2506.15655v1

, , , , In the realm of code generation, Retrieval-Augmented Generation (RAG) has emerged as a crucial technique for enhancing large-scale code generation tasks by grounding predictions in external code corpora to improve accuracy. However, a key yet often overlooked aspect of RAG pipelines is the process of chunking, which involves dividing documents into retrievable units. Traditional line-based chunking methods frequently disrupt semantic structures by splitting functions or merging unrelated code segments, leading to a degradation in the quality of generated code. To address this issue, we introduce chunking via Abstract Syntax Trees (CAST), a structure-aware approach that breaks down large AST nodes into smaller, more manageable chunks and consolidates sibling nodes while adhering to predefined size limits. This method generates self-contained and semantically coherent units across various programming languages and tasks, ultimately enhancing performance on diverse code generation tasks. For instance, our approach has been shown to boost Recall@5 by 4.3 points on RepoEval retrieval and Pass@1 by 2.67 points on SWE-bench generation. Our work underscores the significance of structure-aware chunking in scaling up retrieval-enhanced code intelligence. By focusing on the initial stage of the RAG pipeline - chunking - we aim to parse source code into meaningful units such as functions or classes while preserving the overall structure of the code. These units are then grouped into coherent chunks that serve as retrievable contexts for subsequent retrieval processes and language model prompts. The design goals for CAST revolve around four key principles: maintaining syntactic integrity by aligning chunk boundaries with complete syntactic units whenever possible; maximizing information density within each chunk up to a specified size limit; ensuring language invariance so that the algorithm can be applied uniformly across different programming languages and tasks; and enabling seamless integration within existing RAG pipelines through plug-and-play compatibility. AST Parsing plays a crucial role in supporting syntax-aware chunking, allowing us to accurately identify and extract meaningful chunks from source code while preserving its structural formatting. Our experiments demonstrate promising results, showcasing how leveraging AST-based chunking can significantly improve retrieval-enhanced code intelligence by enhancing semantic coherence and overall performance across various coding tasks.
Created on 11 Dec. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.