What Makes a Good Story and How Can We Measure It? A Comprehensive Survey of Story Evaluation

AI-generated keywords: Story Evaluation Artificial Intelligence Large Language Models Narrative Content Human-AI Collaboration

AI-generated Key Points

⚠The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

Authors Dingyi Yang and Qin Jin explore story evaluation in the age of artificial intelligence
Advancements in Large Language Models (LLMs) have led to an increase in both quantity and quality of automatically generated stories
Automatic story evaluation methods are necessary to assess generative capabilities of computing systems and compare machine-generated narratives with human-crafted ones
Evaluating a story presents unique challenges compared to other generation evaluation tasks like machine translation, requiring a nuanced approach focusing on coherence, character development, and interestingness
The authors review existing research on storytelling tasks, including text-to-text, visual-to-text, and text-to-visual scenarios
Various human criteria are identified for measuring stories along with benchmark datasets for evaluation purposes
A taxonomy is proposed by the authors to categorize existing metrics for story assessment and suggest new ones that can be adopted
The survey discusses the potential for human-AI collaboration in story evaluation and generation processes
Future research directions include expanding evaluations across different domains to enhance computational storytelling capabilities

Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Dingyi Yang, Qin Jin

arXiv: 2408.14622v1 - DOI (cs.CL)

License: NONEXCLUSIVE-DISTRIB 1.0

Abstract: With the development of artificial intelligence, particularly the success of Large Language Models (LLMs), the quantity and quality of automatically generated stories have significantly increased. This has led to the need for automatic story evaluation to assess the generative capabilities of computing systems and analyze the quality of both automatic-generated and human-written stories. Evaluating a story can be more challenging than other generation evaluation tasks. While tasks like machine translation primarily focus on assessing the aspects of fluency and accuracy, story evaluation demands complex additional measures such as overall coherence, character development, interestingness, etc. This requires a thorough review of relevant research. In this survey, we first summarize existing storytelling tasks, including text-to-text, visual-to-text, and text-to-visual. We highlight their evaluation challenges, identify various human criteria to measure stories, and present existing benchmark datasets. Then, we propose a taxonomy to organize evaluation metrics that have been developed or can be adopted for story evaluation. We also provide descriptions of these metrics, along with the discussion of their merits and limitations. Later, we discuss the human-AI collaboration for story evaluation and generation. Finally, we suggest potential future research directions, extending from story evaluation to general evaluations.

Submitted to arXiv on 26 Aug. 2024

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

⚠The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2408.14622v1

⚠This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

Comprehensive Summary
Key points
Layman's Summary
Blog article

In their comprehensive survey titled "What Makes a Good Story and How Can We Measure It? ", authors Dingyi Yang and Qin Jin delve into the realm of story evaluation in the age of artificial intelligence. They highlight the significant increase in both quantity and quality of automatically generated stories, thanks to advancements in Large Language Models (LLMs). This surge has necessitated the development of automatic story evaluation methods to assess the generative capabilities of computing systems and compare the quality of machine-generated narratives with those crafted by humans. The authors point out that evaluating a story poses unique challenges compared to other generation evaluation tasks like machine translation. While fluency and accuracy are key metrics in translation tasks, assessing a story's overall coherence, character development, and interestingness requires a more nuanced approach. To address these complexities, Yang and Jin conduct a thorough review of existing research on storytelling tasks, including text-to-text, visual-to-text, and text-to-visual scenarios. They also identify various human criteria used to measure stories and present benchmark datasets for evaluation purposes. To organize the plethora of evaluation metrics available for story assessment, the authors propose a taxonomy that categorizes existing metrics and suggests new ones that can be adopted. They provide detailed descriptions of these metrics along with discussions on their strengths and limitations. Additionally, they explore the potential for human-AI collaboration in both story evaluation and generation processes. Looking towards future research directions, Yang and Jin suggest expanding from story evaluation to encompass general evaluations across different domains. By shedding light on the intricacies of evaluating narrative content in an AI-driven landscape, this survey serves as a valuable resource for researchers seeking to enhance computational storytelling capabilities.

- Authors Dingyi Yang and Qin Jin explore story evaluation in the age of artificial intelligence
- Advancements in Large Language Models (LLMs) have led to an increase in both quantity and quality of automatically generated stories
- Automatic story evaluation methods are necessary to assess generative capabilities of computing systems and compare machine-generated narratives with human-crafted ones
- Evaluating a story presents unique challenges compared to other generation evaluation tasks like machine translation, requiring a nuanced approach focusing on coherence, character development, and interestingness
- The authors review existing research on storytelling tasks, including text-to-text, visual-to-text, and text-to-visual scenarios
- Various human criteria are identified for measuring stories along with benchmark datasets for evaluation purposes
- A taxonomy is proposed by the authors to categorize existing metrics for story assessment and suggest new ones that can be adopted
- The survey discusses the potential for human-AI collaboration in story evaluation and generation processes
- Future research directions include expanding evaluations across different domains to enhance computational storytelling capabilities

Summary- Authors Dingyi Yang and Qin Jin study how computers can evaluate stories. - Computers have gotten better at making stories using Large Language Models (LLMs). - We need ways to check if computer-made stories are good compared to human-made ones. - Checking a story is tricky and needs a special approach focusing on coherence, character development, and interest. - The authors look at different types of research on storytelling tasks. Definitions- Authors: People who write books or articles. - Artificial Intelligence: Computer systems that can perform tasks that normally require human intelligence. - Large Language Models (LLMs): Advanced computer programs that help generate text. - Evaluate: To examine or judge something carefully. - Coherence: Making sure things in a story make sense together. - Character Development: How well the characters in a story change and grow. - Interest/Interestingness: Keeping readers engaged and curious about the story.

Introduction

In recent years, there has been a significant increase in the use of artificial intelligence (AI) for generating stories. This surge is due to advancements in Large Language Models (LLMs), which have greatly improved the quality and quantity of automatically generated narratives. However, with this rise in machine-generated storytelling comes the need for effective methods to evaluate and compare these stories with those crafted by humans. In their comprehensive survey titled "What Makes a Good Story and How Can We Measure It?", authors Dingyi Yang and Qin Jin delve into the realm of story evaluation in the age of AI.

The Need for Automatic Story Evaluation

The authors highlight that evaluating a story poses unique challenges compared to other generation evaluation tasks like machine translation. While fluency and accuracy are key metrics in translation tasks, assessing a story's overall coherence, character development, and interestingness requires a more nuanced approach. Additionally, as LLMs continue to improve, it becomes increasingly difficult for human evaluators to keep up with the volume of machine-generated stories. To address these complexities, Yang and Jin conduct a thorough review of existing research on storytelling tasks, including text-to-text, visual-to-text, and text-to-visual scenarios. They also identify various human criteria used to measure stories and present benchmark datasets for evaluation purposes.

Evaluation Metrics

To organize the plethora of evaluation metrics available for story assessment, Yang and Jin propose a taxonomy that categorizes existing metrics into three main categories: surface-level metrics (e.g., word count), content-level metrics (e.g., plot structure), and reader-based metrics (e.g., perceived enjoyment). The authors also suggest new metrics that can be adopted based on their analysis of current research.

Surface-Level Metrics

Surface-level metrics focus on basic characteristics such as word count or sentence length. These can provide insights into the overall structure and complexity of a story. However, they do not capture more nuanced aspects such as plot development or character emotions.

Content-Level Metrics

Content-level metrics assess the quality of a story's content, including elements such as plot structure, character development, and dialogue. These metrics require a deeper understanding of storytelling techniques and can provide valuable insights into the overall coherence and effectiveness of a narrative.

Reader-Based Metrics

Reader-based metrics focus on how readers perceive and engage with a story. These include measures like perceived enjoyment, emotional response, and engagement level. While these metrics may be subjective to individual readers, they provide important insights into the impact of a story on its audience.

Human-AI Collaboration in Storytelling

Yang and Jin also explore the potential for human-AI collaboration in both story evaluation and generation processes. They suggest that combining human evaluators' expertise with AI systems' computational power can lead to more accurate evaluations while also improving machine-generated stories' quality. Furthermore, the authors propose incorporating human feedback into AI systems during the generation process to improve their storytelling capabilities continually. This approach could result in more engaging narratives that appeal to human audiences while still leveraging AI's efficiency.

Future Research Directions

Looking towards future research directions, Yang and Jin suggest expanding from story evaluation to encompass general evaluations across different domains. This expansion could involve developing new evaluation methods that consider multiple factors simultaneously or exploring ways to incorporate cultural differences in evaluating stories. The authors also highlight the need for further research on evaluating non-linguistic aspects of storytelling, such as visual elements or audio components. As technology continues to advance, it is essential to develop comprehensive evaluation methods that consider all aspects of storytelling fully.

Conclusion

In conclusion, "What Makes a Good Story and How Can We Measure It?" provides a comprehensive overview of the current state of story evaluation in the age of AI. By shedding light on the intricacies of evaluating narrative content in an AI-driven landscape, this survey serves as a valuable resource for researchers seeking to enhance computational storytelling capabilities. The proposed taxonomy and detailed descriptions of various evaluation metrics provide a solid foundation for future research in this field. As AI continues to play an increasingly significant role in generating stories, effective evaluation methods will be crucial in ensuring the quality and impact of these narratives on human audiences.

Created on 06 Sep. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Similar papers summarized with our AI tools

65.8%

SummEval: Re-evaluating Summarization Evaluation

cs.CL

65.0%

Is my Meeting Summary Good? Estimating Quality with a Multi-LLM Evaluator

cs.CL

64.8%

A Survey of Evaluation Metrics Used for NLG Systems

cs.CL

63.9%

What Makes a Message Persuasive? Identifying Adaptations Towards Persuasivene…

cs.CL

63.3%

Benchmarking Generation and Evaluation Capabilities of Large Language Models …

cs.CL

63.2%

FABLES: Evaluating faithfulness and content selection in book-length summariz…

cs.CL

63.1%

Language Models (Mostly) Know What They Know

cs.CL

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.