In their comprehensive survey titled "What Makes a Good Story and How Can We Measure It? ", authors Dingyi Yang and Qin Jin delve into the realm of story evaluation in the age of artificial intelligence. They highlight the significant increase in both quantity and quality of automatically generated stories, thanks to advancements in Large Language Models (LLMs). This surge has necessitated the development of automatic story evaluation methods to assess the generative capabilities of computing systems and compare the quality of machine-generated narratives with those crafted by humans. The authors point out that evaluating a story poses unique challenges compared to other generation evaluation tasks like machine translation. While fluency and accuracy are key metrics in translation tasks, assessing a story's overall coherence, character development, and interestingness requires a more nuanced approach. To address these complexities, Yang and Jin conduct a thorough review of existing research on storytelling tasks, including text-to-text, visual-to-text, and text-to-visual scenarios. They also identify various human criteria used to measure stories and present benchmark datasets for evaluation purposes. To organize the plethora of evaluation metrics available for story assessment, the authors propose a taxonomy that categorizes existing metrics and suggests new ones that can be adopted. They provide detailed descriptions of these metrics along with discussions on their strengths and limitations. Additionally, they explore the potential for human-AI collaboration in both story evaluation and generation processes. Looking towards future research directions, Yang and Jin suggest expanding from story evaluation to encompass general evaluations across different domains. By shedding light on the intricacies of evaluating narrative content in an AI-driven landscape, this survey serves as a valuable resource for researchers seeking to enhance computational storytelling capabilities.
- - Authors Dingyi Yang and Qin Jin explore story evaluation in the age of artificial intelligence
- - Advancements in Large Language Models (LLMs) have led to an increase in both quantity and quality of automatically generated stories
- - Automatic story evaluation methods are necessary to assess generative capabilities of computing systems and compare machine-generated narratives with human-crafted ones
- - Evaluating a story presents unique challenges compared to other generation evaluation tasks like machine translation, requiring a nuanced approach focusing on coherence, character development, and interestingness
- - The authors review existing research on storytelling tasks, including text-to-text, visual-to-text, and text-to-visual scenarios
- - Various human criteria are identified for measuring stories along with benchmark datasets for evaluation purposes
- - A taxonomy is proposed by the authors to categorize existing metrics for story assessment and suggest new ones that can be adopted
- - The survey discusses the potential for human-AI collaboration in story evaluation and generation processes
- - Future research directions include expanding evaluations across different domains to enhance computational storytelling capabilities
Summary- Authors Dingyi Yang and Qin Jin study how computers can evaluate stories.
- Computers have gotten better at making stories using Large Language Models (LLMs).
- We need ways to check if computer-made stories are good compared to human-made ones.
- Checking a story is tricky and needs a special approach focusing on coherence, character development, and interest.
- The authors look at different types of research on storytelling tasks.
Definitions- Authors: People who write books or articles.
- Artificial Intelligence: Computer systems that can perform tasks that normally require human intelligence.
- Large Language Models (LLMs): Advanced computer programs that help generate text.
- Evaluate: To examine or judge something carefully.
- Coherence: Making sure things in a story make sense together.
- Character Development: How well the characters in a story change and grow.
- Interest/Interestingness: Keeping readers engaged and curious about the story.
Introduction
In recent years, there has been a significant increase in the use of artificial intelligence (AI) for generating stories. This surge is due to advancements in Large Language Models (LLMs), which have greatly improved the quality and quantity of automatically generated narratives. However, with this rise in machine-generated storytelling comes the need for effective methods to evaluate and compare these stories with those crafted by humans. In their comprehensive survey titled "What Makes a Good Story and How Can We Measure It?", authors Dingyi Yang and Qin Jin delve into the realm of story evaluation in the age of AI.
The Need for Automatic Story Evaluation
The authors highlight that evaluating a story poses unique challenges compared to other generation evaluation tasks like machine translation. While fluency and accuracy are key metrics in translation tasks, assessing a story's overall coherence, character development, and interestingness requires a more nuanced approach. Additionally, as LLMs continue to improve, it becomes increasingly difficult for human evaluators to keep up with the volume of machine-generated stories.
To address these complexities, Yang and Jin conduct a thorough review of existing research on storytelling tasks, including text-to-text, visual-to-text, and text-to-visual scenarios. They also identify various human criteria used to measure stories and present benchmark datasets for evaluation purposes.
Evaluation Metrics
To organize the plethora of evaluation metrics available for story assessment, Yang and Jin propose a taxonomy that categorizes existing metrics into three main categories: surface-level metrics (e.g., word count), content-level metrics (e.g., plot structure), and reader-based metrics (e.g., perceived enjoyment). The authors also suggest new metrics that can be adopted based on their analysis of current research.
Surface-Level Metrics
Surface-level metrics focus on basic characteristics such as word count or sentence length. These can provide insights into the overall structure and complexity of a story. However, they do not capture more nuanced aspects such as plot development or character emotions.
Content-Level Metrics
Content-level metrics assess the quality of a story's content, including elements such as plot structure, character development, and dialogue. These metrics require a deeper understanding of storytelling techniques and can provide valuable insights into the overall coherence and effectiveness of a narrative.
Reader-Based Metrics
Reader-based metrics focus on how readers perceive and engage with a story. These include measures like perceived enjoyment, emotional response, and engagement level. While these metrics may be subjective to individual readers, they provide important insights into the impact of a story on its audience.
Human-AI Collaboration in Storytelling
Yang and Jin also explore the potential for human-AI collaboration in both story evaluation and generation processes. They suggest that combining human evaluators' expertise with AI systems' computational power can lead to more accurate evaluations while also improving machine-generated stories' quality.
Furthermore, the authors propose incorporating human feedback into AI systems during the generation process to improve their storytelling capabilities continually. This approach could result in more engaging narratives that appeal to human audiences while still leveraging AI's efficiency.
Future Research Directions
Looking towards future research directions, Yang and Jin suggest expanding from story evaluation to encompass general evaluations across different domains. This expansion could involve developing new evaluation methods that consider multiple factors simultaneously or exploring ways to incorporate cultural differences in evaluating stories.
The authors also highlight the need for further research on evaluating non-linguistic aspects of storytelling, such as visual elements or audio components. As technology continues to advance, it is essential to develop comprehensive evaluation methods that consider all aspects of storytelling fully.
Conclusion
In conclusion, "What Makes a Good Story and How Can We Measure It?" provides a comprehensive overview of the current state of story evaluation in the age of AI. By shedding light on the intricacies of evaluating narrative content in an AI-driven landscape, this survey serves as a valuable resource for researchers seeking to enhance computational storytelling capabilities. The proposed taxonomy and detailed descriptions of various evaluation metrics provide a solid foundation for future research in this field. As AI continues to play an increasingly significant role in generating stories, effective evaluation methods will be crucial in ensuring the quality and impact of these narratives on human audiences.