Evaluating Large Language Models Trained on Code

AI-generated keywords: Codex GPT-3 GPT-J HumanEval Deployment

AI-generated Key Points

The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

  • Codex is a GPT language model that has been fine-tuned on publicly available code from GitHub.
  • The authors focus on Codex's capabilities in writing Python code.
  • Codex powers GitHub Copilot, a distinct production version.
  • The authors introduce a new evaluation set called HumanEval to measure functional correctness in synthesizing programs from docstrings.
  • Codex outperforms other models like GPT-3 and GPT-J in solving problems in HumanEval.
  • Repeated sampling from the model is an effective strategy for generating working solutions to challenging prompts.
  • However, Codex has limitations with long chains of operations and binding operations to variables in docstrings.
  • The paper discusses potential broader impacts of deploying powerful code generation technologies like Codex, addressing concerns related to safety, security, and economics.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde, Jared Kaplan, Harri Edwards, Yura Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, Will Guss, Alex Nichol, Igor Babuschkin, Suchir Balaji, Shantanu Jain, Andrew Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, Wojciech Zaremba

Abstract: We introduce Codex, a GPT language model fine-tuned on publicly available code from GitHub, and study its Python code-writing capabilities. A distinct production version of Codex powers GitHub Copilot. On HumanEval, a new evaluation set we release to measure functional correctness for synthesizing programs from docstrings, our model solves 28.8% of the problems, while GPT-3 solves 0% and GPT-J solves 11.4%. Furthermore, we find that repeated sampling from the model is a surprisingly effective strategy for producing working solutions to difficult prompts. Using this method, we solve 70.2% of our problems with 100 samples per problem. Careful investigation of our model reveals its limitations, including difficulty with docstrings describing long chains of operations and with binding operations to variables. Finally, we discuss the potential broader impacts of deploying powerful code generation technologies, covering safety, security, and economics.

Submitted to arXiv on 07 Jul. 2021

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2107.03374v1

This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

In their paper, the authors introduce Codex, a GPT language model that has been fine-tuned on publicly available code from GitHub. They specifically focus on studying Codex's capabilities in writing Python code. It is worth noting that a distinct production version of Codex powers GitHub Copilot. To evaluate the model's performance, the authors release a new evaluation set called HumanEval, which measures functional correctness in synthesizing programs from docstrings. The results show that Codex outperforms other models such as GPT-3 and GPT-J. Specifically, Codex solves 28.8% of the problems in HumanEval, while GPT-3 solves 0% and GPT-J solves 11.4%. The authors also discover that repeated sampling from the model is an unexpectedly effective strategy for generating working solutions to challenging prompts. By using this method with 100 samples per problem, they are able to solve 70.2% of the problems. However, careful investigation of Codex reveals certain limitations; for instance it struggles with docstrings that describe long chains of operations and with binding operations to variables. Finally, the paper discusses the potential broader impacts of deploying powerful code generation technologies like Codex and addresses concerns related to safety, security and economics that arise from such deployments. Overall, this refined summary provides more detailed information about Codex's performance compared to other models and highlights its limitations when dealing with specific types of code-related tasks. Additionally it emphasizes the importance of considering potential implications when deploying advanced code generation technologies like Codex.
Created on 02 Nov. 2023

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

The license of this specific paper does not allow us to build upon its content and the summarizing tools will be run using the paper metadata rather than the full article. However, it still does a good job, and you can also try our tools on papers with more open licenses.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.