How is ChatGPT's behavior changing over time?

AI-generated keywords: ChatGPT GPT-3.5 GPT-4 language models behavioral shifts

AI-generated Key Points

The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

  • Study by Lingjiao Chen, Matei Zaharia, and James Zou on evolving behavior of ChatGPT's large language models (LLMs), focusing on GPT-3.5 and GPT-4
  • Lack of transparency in update process highlighted
  • Evaluations conducted across various tasks to assess performance and behavior changes over time
  • Significant variations in capabilities of GPT-3.5 and GPT-4 observed
  • Performance example: GPT-4's decline in accuracy for distinguishing prime numbers from composite numbers attributed to decrease in chain-of-thought prompting ability
  • Improvement seen in GPT-3.5 between March and June for certain tasks
  • Behavioral shifts noted, such as GPT-4 becoming less willing to answer sensitive questions but performing better on multi-hop knowledge-intensive questions
  • Both models exhibit more formatting errors in code generation tasks in June compared to March
  • Importance of continuous monitoring emphasized to track changes accurately
  • Necessity for ongoing scrutiny and evaluation of LLM models stressed for reliability and effectiveness across diverse tasks
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Lingjiao Chen, Matei Zaharia, James Zou

add more evaluations on instruction following

Abstract: GPT-3.5 and GPT-4 are the two most widely used large language model (LLM) services. However, when and how these models are updated over time is opaque. Here, we evaluate the March 2023 and June 2023 versions of GPT-3.5 and GPT-4 on several diverse tasks: 1) math problems, 2) sensitive/dangerous questions, 3) opinion surveys, 4) multi-hop knowledge-intensive questions, 5) generating code, 6) US Medical License tests, and 7) visual reasoning. We find that the performance and behavior of both GPT-3.5 and GPT-4 can vary greatly over time. For example, GPT-4 (March 2023) was reasonable at identifying prime vs. composite numbers (84% accuracy) but GPT-4 (June 2023) was poor on these same questions (51% accuracy). This is partly explained by a drop in GPT-4's amenity to follow chain-of-thought prompting. Interestingly, GPT-3.5 was much better in June than in March in this task. GPT-4 became less willing to answer sensitive questions and opinion survey questions in June than in March. GPT-4 performed better at multi-hop questions in June than in March, while GPT-3.5's performance dropped on this task. Both GPT-4 and GPT-3.5 had more formatting mistakes in code generation in June than in March. We provide evidence that GPT-4's ability to follow user instructions has decreased over time, which is one common factor behind the many behavior drifts. Overall, our findings show that the behavior of the "same" LLM service can change substantially in a relatively short amount of time, highlighting the need for continuous monitoring of LLMs.

Submitted to arXiv on 18 Jul. 2023

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2307.09009v3

This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

In their study, Lingjiao Chen, Matei Zaharia, and James Zou investigate the evolving behavior of ChatGPT's large language models (LLMs), specifically GPT-3.5 and GPT-4. They aim to shed light on the changes that occur between the March 2023 and June 2023 versions of both models and highlight the lack of transparency in their update process. The researchers conduct evaluations across various tasks including math problems, sensitive questions, opinion surveys, code generation, and more to assess the performance and behavior of both models over time. Their analysis reveals significant variations in the capabilities of GPT-3.5 and GPT-4 over time. For example, while GPT-4 (March 2023) shows reasonable accuracy in distinguishing prime numbers from composite numbers (84%), its performance drops significantly in June 2023 (51%). This decline is attributed to a decrease in its ability to follow chain-of-thought prompting. Interestingly, GPT-3.5 shows improvement in this task between March and June. The researchers also find that while GPT-4 becomes less willing to answer sensitive questions and opinion survey queries in June compared to March, it performs better on multi-hop knowledge-intensive questions. On the other hand, GPT-3.5's performance declines in this area between March and June. Both models exhibit more formatting errors in code generation tasks in June than in March. One notable finding is that as time passes, GPT-4's proficiency at following user instructions diminishes which contributes to several behavioral shifts observed during the evaluations. This highlights the dynamic nature of LLM services like ChatGPT and emphasizes the importance of continuous monitoring to accurately track changes in behavior. Overall, Chen et al. 's research underscores the necessity for ongoing scrutiny and evaluation of LLM models to ensure their reliability and effectiveness across diverse tasks and applications. Their findings demonstrate how even seemingly minor updates can lead to substantial shifts in an LLM service's behavior within a short timeframe.
Created on 25 Jun. 2024

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

The license of this specific paper does not allow us to build upon its content and the summarizing tools will be run using the paper metadata rather than the full article. However, it still does a good job, and you can also try our tools on papers with more open licenses.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.