GPTEval: NLG Evaluation using GPT-4 with Better Human Alignment

AI-generated keywords: GPTEval NLG Evaluation GPT-4 Human Alignment LLM-based Evaluators

AI-generated Key Points

The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

  • The paper addresses the challenge of automatically measuring the quality of texts generated by NLG systems.
  • Traditional reference-based metrics like BLEU and ROUGE have limited correlation with human judgments, especially for creative and diverse tasks.
  • LLM-based evaluators have lower human correspondence compared to medium-size neural evaluators.
  • GPTEval is introduced as a framework that utilizes large language models with chain-of-thoughts (CoT) and a form-filling paradigm for NLG evaluation.
  • GPTEval achieves a high Spearman correlation of 0.514 with human judgments in the text summarization task, outperforming previous methods significantly.
  • LLM-based evaluators may exhibit bias towards LLM-generated texts, highlighting the need for further investigation into mitigating biases in evaluation methods.
  • GPTEval contributes to advancing NLG evaluation by leveraging large language models to improve alignment with human judgments.
  • The findings have implications for enhancing the assessment of NLG system outputs in domains requiring creativity and diversity.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, Chenguang Zhu

License: CC BY-NC-ND 4.0

Abstract: The quality of texts generated by natural language generation (NLG) systems is hard to measure automatically. Conventional reference-based metrics, such as BLEU and ROUGE, have been shown to have relatively low correlation with human judgments, especially for tasks that require creativity and diversity. Recent studies suggest using large language models (LLMs) as reference-free metrics for NLG evaluation, which have the benefit of being applicable to new tasks that lack human references. However, these LLM-based evaluators still have lower human correspondence than medium-size neural evaluators. In this work, we present GPTEval, a framework of using large language models with chain-of-thoughts (CoT) and a form-filling paradigm, to assess the quality of NLG outputs. We experiment with two generation tasks, text summarization and dialogue generation. We show that GPTEval with GPT-4 as the backbone model achieves a Spearman correlation of 0.514 with human on summarization task, outperforming all previous methods by a large margin. We also propose preliminary analysis on the behavior of LLM-based evaluators, and highlight the potential issue of LLM-based evaluators having a bias towards the LLM-generated texts.

Submitted to arXiv on 29 Mar. 2023

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2303.16634v1

This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

The paper titled "GPTEval: NLG Evaluation using GPT-4 with Better Human Alignment" addresses the challenge of automatically measuring the quality of texts generated by natural language generation (NLG) systems. Traditional reference-based metrics like BLEU and ROUGE have shown limited correlation with human judgments, particularly for tasks that require creativity and diversity. Recent studies propose using large language models (LLMs) as reference-free metrics for NLG evaluation, which can be applied to new tasks lacking human references. However, LLM-based evaluators still exhibit lower human correspondence compared to medium-size neural evaluators. To overcome these limitations, the authors introduce GPTEval, a framework that utilizes large language models with chain-of-thoughts (CoT) and a form-filling paradigm to assess the quality of NLG outputs. The framework is evaluated on two generation tasks: text summarization and dialogue generation. The results demonstrate that GPTEval with GPT-4 as the backbone model achieves a high Spearman correlation of 0.514 with human judgments in the summarization task, outperforming previous methods significantly. Additionally, the paper presents an analysis of LLM-based evaluators' behavior and highlights an issue where they may exhibit bias towards LLM-generated texts. This finding emphasizes the need for further investigation into mitigating biases in LLM-based evaluation methods. Overall, this research contributes to advancing NLG evaluation by introducing GPTEval as an effective framework that leverages large language models to improve alignment with human judgments. The findings have implications for enhancing the assessment of NLG system outputs in various domains requiring creativity and diversity.
Created on 05 Jul. 2023

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

The license of this specific paper does not allow us to build upon its content and the summarizing tools will be run using the paper metadata rather than the full article. However, it still does a good job, and you can also try our tools on papers with more open licenses.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.