A Survey of Evaluation Metrics Used for NLG Systems

AI-generated keywords: NLG Evaluation Metrics Automatic Evaluation Image Captioning Transformer Models

AI-generated Key Points

⚠The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

Deep Learning has led to increased interest in Natural Language Generation (NLG) tasks, including image captioning.
Automatic evaluation metrics are needed to track advancements in NLG systems.
Early heuristic-based metrics like BLEU and ROUGE are inadequate for capturing nuances in NLG tasks.
Since 2014, there has been a surge in proposed evaluation metrics due to the increasing number of NLG models and limitations of current metrics.
There is a shift from using predetermined heuristic formulas to trained transformer models for evaluation.
This survey highlights challenges in automatically evaluating NLG systems and provides a taxonomy of existing evaluation metrics.
The survey describes various metrics, their contributions, shortcomings, and the methodology used for evaluation.
Suggestions and recommendations are offered for improving automatic evaluation metrics.
The survey aims to help researchers quickly familiarize themselves with developments in NLG evaluation.

Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Ananya B. Sai, Akash Kumar Mohankumar, Mitesh M. Khapra

arXiv: 2008.12009v2 - DOI (cs.CL)

A condensed version of this paper is submitted to ACM CSUR

License: NONEXCLUSIVE-DISTRIB 1.0

Abstract: The success of Deep Learning has created a surge in interest in a wide a range of Natural Language Generation (NLG) tasks. Deep Learning has not only pushed the state of the art in several existing NLG tasks but has also facilitated researchers to explore various newer NLG tasks such as image captioning. Such rapid progress in NLG has necessitated the development of accurate automatic evaluation metrics that would allow us to track the progress in the field of NLG. However, unlike classification tasks, automatically evaluating NLG systems in itself is a huge challenge. Several works have shown that early heuristic-based metrics such as BLEU, ROUGE are inadequate for capturing the nuances in the different NLG tasks. The expanding number of NLG models and the shortcomings of the current metrics has led to a rapid surge in the number of evaluation metrics proposed since 2014. Moreover, various evaluation metrics have shifted from using pre-determined heuristic-based formulae to trained transformer models. This rapid change in a relatively short time has led to the need for a survey of the existing NLG metrics to help existing and new researchers to quickly come up to speed with the developments that have happened in NLG evaluation in the last few years. Through this survey, we first wish to highlight the challenges and difficulties in automatically evaluating NLG systems. Then, we provide a coherent taxonomy of the evaluation metrics to organize the existing metrics and to better understand the developments in the field. We also describe the different metrics in detail and highlight their key contributions. Later, we discuss the main shortcomings identified in the existing metrics and describe the methodology used to evaluate evaluation metrics. Finally, we discuss our suggestions and recommendations on the next steps forward to improve the automatic evaluation metrics.

Submitted to arXiv on 27 Aug. 2020

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

⚠The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2008.12009v2

⚠This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

Comprehensive Summary
Key points
Layman's Summary
Blog article

The success of Deep Learning has sparked a growing interest in various Natural Language Generation (NLG) tasks, including image captioning. As NLG progresses rapidly, there is a need for accurate automatic evaluation metrics to track advancements in the field. However, evaluating NLG systems automatically presents significant challenges. Early heuristic-based metrics like BLEU and ROUGE have proven inadequate for capturing the nuances of different NLG tasks. The increasing number of NLG models and the limitations of current metrics have led to a surge in proposed evaluation metrics since 2014. Additionally, there has been a shift from using predetermined heuristic formulas to trained transformer models for evaluation. To address these developments and assist researchers, this survey highlights the challenges involved in automatically evaluating NLG systems. It provides a coherent taxonomy of existing evaluation metrics to better understand the field's progress. The survey describes various metrics in detail, emphasizing their key contributions as well as main shortcomings identified in current metrics and outlines the methodology used to evaluate these evaluation metrics. Finally, the survey offers suggestions and recommendations for improving automatic evaluation metrics going forward. By providing an overview of existing NLG metrics and their advancements over recent years, this survey aims to help both new and established researchers quickly familiarize themselves with developments in NLG evaluation.

- Deep Learning has led to increased interest in Natural Language Generation (NLG) tasks, including image captioning.
- Automatic evaluation metrics are needed to track advancements in NLG systems.
- Early heuristic-based metrics like BLEU and ROUGE are inadequate for capturing nuances in NLG tasks.
- Since 2014, there has been a surge in proposed evaluation metrics due to the increasing number of NLG models and limitations of current metrics.
- There is a shift from using predetermined heuristic formulas to trained transformer models for evaluation.
- This survey highlights challenges in automatically evaluating NLG systems and provides a taxonomy of existing evaluation metrics.
- The survey describes various metrics, their contributions, shortcomings, and the methodology used for evaluation.
- Suggestions and recommendations are offered for improving automatic evaluation metrics.
- The survey aims to help researchers quickly familiarize themselves with developments in NLG evaluation.

Deep Learning is a type of technology that has made people more interested in tasks like describing pictures with words. Automatic evaluation metrics are tools that help us measure how well these tasks are being done. Some older tools for measuring these tasks, like BLEU and ROUGE, aren't good enough because they can't capture all the details. Since 2014, there have been many new tools created to measure these tasks because there are more models and the old tools have problems. Now, instead of using set formulas, we use special trained models to measure how well these tasks are done. This survey talks about the challenges of measuring these tasks and gives examples of different ways to do it. It also gives suggestions for making the measurements better. The survey is meant to help researchers learn about the latest developments in measuring these tasks."

Exploring the Challenges of Automatically Evaluating Natural Language Generation Systems

In recent years, the success of deep learning has sparked a growing interest in various natural language generation (NLG) tasks such as image captioning. As NLG progresses rapidly, there is an increasing need for accurate automatic evaluation metrics to track advancements in the field. However, evaluating NLG systems automatically presents significant challenges due to the complexity and nuances of different NLG tasks.

The Limitations of Heuristic-Based Metrics

Early heuristic-based metrics such as BLEU and ROUGE have proven inadequate for capturing these complexities and nuances. The limitations of current metrics combined with the increasing number of NLG models have led to a surge in proposed evaluation metrics since 2014. Additionally, there has been a shift from using predetermined heuristic formulas to trained transformer models for evaluation.

A Coherent Taxonomy for Existing Evaluation Metrics

To address these developments and assist researchers, this survey highlights the challenges involved in automatically evaluating NLG systems by providing a coherent taxonomy of existing evaluation metrics. It describes various metrics in detail, emphasizing their key contributions as well as main shortcomings identified in current metrics and outlines the methodology used to evaluate these evaluation metrics.

Suggestions & Recommendations Going Forward

Finally, this survey offers suggestions and recommendations for improving automatic evaluation metrics going forward. By providing an overview of existing NLG metrics and their advancements over recent years, this survey aims to help both new and established researchers quickly familiarize themselves with developments in NLG evaluation.

Created on 05 Sep. 2023

Assess the quality of the AI-generated content by voting

Score: 0

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

⚠The license of this specific paper does not allow us to build upon its content and the summarizing tools will be run using the paper metadata rather than the full article. However, it still does a good job, and you can also try our tools on papers with more open licenses.

Similar papers summarized with our AI tools

77.6%

SummEval: Re-evaluating Summarization Evaluation

cs.CL

74.7%

Augmented Language Models: a Survey

cs.CL

74.1%

GPTEval: NLG Evaluation using GPT-4 with Better Human Alignment

cs.CL

73.9%

Is ChatGPT a Good NLG Evaluator? A Preliminary Study

cs.CL

73.0%

Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Larg…

cs.SE

72.9%

A Survey on Recognizing Textual Entailment as an NLP Evaluation

cs.CL

72.3%

Measuring Massive Multitask Language Understanding

cs.CY

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.