What Makes a Good Story and How Can We Measure It? A Comprehensive Survey of Story Evaluation

AI-generated keywords: Story Evaluation Artificial Intelligence Large Language Models Narrative Content Human-AI Collaboration

AI-generated Key Points

The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

  • Authors Dingyi Yang and Qin Jin explore story evaluation in the age of artificial intelligence
  • Advancements in Large Language Models (LLMs) have led to an increase in both quantity and quality of automatically generated stories
  • Automatic story evaluation methods are necessary to assess generative capabilities of computing systems and compare machine-generated narratives with human-crafted ones
  • Evaluating a story presents unique challenges compared to other generation evaluation tasks like machine translation, requiring a nuanced approach focusing on coherence, character development, and interestingness
  • The authors review existing research on storytelling tasks, including text-to-text, visual-to-text, and text-to-visual scenarios
  • Various human criteria are identified for measuring stories along with benchmark datasets for evaluation purposes
  • A taxonomy is proposed by the authors to categorize existing metrics for story assessment and suggest new ones that can be adopted
  • The survey discusses the potential for human-AI collaboration in story evaluation and generation processes
  • Future research directions include expanding evaluations across different domains to enhance computational storytelling capabilities
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Dingyi Yang, Qin Jin

Abstract: With the development of artificial intelligence, particularly the success of Large Language Models (LLMs), the quantity and quality of automatically generated stories have significantly increased. This has led to the need for automatic story evaluation to assess the generative capabilities of computing systems and analyze the quality of both automatic-generated and human-written stories. Evaluating a story can be more challenging than other generation evaluation tasks. While tasks like machine translation primarily focus on assessing the aspects of fluency and accuracy, story evaluation demands complex additional measures such as overall coherence, character development, interestingness, etc. This requires a thorough review of relevant research. In this survey, we first summarize existing storytelling tasks, including text-to-text, visual-to-text, and text-to-visual. We highlight their evaluation challenges, identify various human criteria to measure stories, and present existing benchmark datasets. Then, we propose a taxonomy to organize evaluation metrics that have been developed or can be adopted for story evaluation. We also provide descriptions of these metrics, along with the discussion of their merits and limitations. Later, we discuss the human-AI collaboration for story evaluation and generation. Finally, we suggest potential future research directions, extending from story evaluation to general evaluations.

Submitted to arXiv on 26 Aug. 2024

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2408.14622v1

This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

In their comprehensive survey titled "What Makes a Good Story and How Can We Measure It? ", authors Dingyi Yang and Qin Jin delve into the realm of story evaluation in the age of artificial intelligence. They highlight the significant increase in both quantity and quality of automatically generated stories, thanks to advancements in Large Language Models (LLMs). This surge has necessitated the development of automatic story evaluation methods to assess the generative capabilities of computing systems and compare the quality of machine-generated narratives with those crafted by humans. The authors point out that evaluating a story poses unique challenges compared to other generation evaluation tasks like machine translation. While fluency and accuracy are key metrics in translation tasks, assessing a story's overall coherence, character development, and interestingness requires a more nuanced approach. To address these complexities, Yang and Jin conduct a thorough review of existing research on storytelling tasks, including text-to-text, visual-to-text, and text-to-visual scenarios. They also identify various human criteria used to measure stories and present benchmark datasets for evaluation purposes. To organize the plethora of evaluation metrics available for story assessment, the authors propose a taxonomy that categorizes existing metrics and suggests new ones that can be adopted. They provide detailed descriptions of these metrics along with discussions on their strengths and limitations. Additionally, they explore the potential for human-AI collaboration in both story evaluation and generation processes. Looking towards future research directions, Yang and Jin suggest expanding from story evaluation to encompass general evaluations across different domains. By shedding light on the intricacies of evaluating narrative content in an AI-driven landscape, this survey serves as a valuable resource for researchers seeking to enhance computational storytelling capabilities.
Created on 06 Sep. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.