A parallel corpus of Python functions and documentation strings for automated code documentation and code generation

AI-generated keywords: Automated Code Documentation Automated Code Generation Parallel Corpus Neural Machine Translation Data Augmentation

AI-generated Key Points

The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

  • Authors address automated documentation of programming source code and code generation from natural language
  • Introduce a large and diverse parallel corpus of Python functions and docstrings
  • Dataset aims to provide a valuable resource for research in code documentation and code generation
  • Employ neural machine translation techniques for baseline results
  • Experiment with data augmentation techniques to augment training data
  • Release datasets and processing scripts to encourage further research
  • Contributes significantly to addressing challenges in automated code documentation and code generation
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Antonio Valerio Miceli Barone, Rico Sennrich

5 pages, 1 figure, 3 tables

Abstract: Automated documentation of programming source code and automated code generation from natural language are challenging tasks of both practical and scientific interest. Progress in these areas has been limited by the low availability of parallel corpora of code and natural language descriptions, which tend to be small and constrained to specific domains. In this work we introduce a large and diverse parallel corpus of a hundred thousands Python functions with their documentation strings ("docstrings") generated by scraping open source repositories on GitHub. We describe baseline results for the code documentation and code generation tasks obtained by neural machine translation. We also experiment with data augmentation techniques to further increase the amount of training data. We release our datasets and processing scripts in order to stimulate research in these areas.

Submitted to arXiv on 07 Jul. 2017

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 1707.02275v1

This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

In their paper titled "A parallel corpus of Python functions and documentation strings for automated code documentation and code generation," authors Antonio Valerio Miceli Barone and Rico Sennrich address the challenging tasks of automated documentation of programming source code and automated code generation from natural language. To overcome the limitation of limited availability of parallel corpora consisting of both code and natural language descriptions, they introduce a large and diverse parallel corpus comprising hundreds of thousands of Python functions along with their corresponding documentation strings ("docstrings"). This dataset aims to provide a valuable resource for research in code documentation and code generation. The authors employ neural machine translation techniques to obtain baseline results for the tasks of code documentation and code generation using their parallel corpus. Additionally, they experiment with data augmentation techniques to augment the training data further. By releasing their datasets and processing scripts, they aim to encourage further research in these areas. Overall, this work contributes significantly to addressing the challenges associated with automated code documentation and code generation by providing a substantial parallel corpus that can facilitate advancements in these fields.
Created on 26 Dec. 2023

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

The license of this specific paper does not allow us to build upon its content and the summarizing tools will be run using the paper metadata rather than the full article. However, it still does a good job, and you can also try our tools on papers with more open licenses.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.