Is my Meeting Summary Good? Estimating Quality with a Multi-LLM Evaluator

AI-generated keywords: Meeting summaries

AI-generated Key Points

  • Meeting summaries are crucial in professional environments for providing references, updates for absentees, and reinforcing key topics discussed during meetings.
  • The integration of summarization services into popular digital meeting platforms has increased the importance of generating high-quality meeting summaries.
  • Traditional evaluation of meeting summaries is done through costly and time-consuming human assessment.
  • Existing automatic metrics like ROUGE, BERTScore, and BARTScore show low correlation with human judgment in evaluating meeting summaries.
  • MESA (Meeting Summary Assessor) is a multi-stage LLM-based framework designed to mimic human evaluation approaches for meeting summary assessment.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Frederic Kirstein, Terry Ruas, Bela Gipp

COLING 2025 Industry Track
License: CC BY-SA 4.0

Abstract: The quality of meeting summaries generated by natural language generation (NLG) systems is hard to measure automatically. Established metrics such as ROUGE and BERTScore have a relatively low correlation with human judgments and fail to capture nuanced errors. Recent studies suggest using large language models (LLMs), which have the benefit of better context understanding and adaption of error definitions without training on a large number of human preference judgments. However, current LLM-based evaluators risk masking errors and can only serve as a weak proxy, leaving human evaluation the gold standard despite being costly and hard to compare across studies. In this work, we present MESA, an LLM-based framework employing a three-step assessment of individual error types, multi-agent discussion for decision refinement, and feedback-based self-training to refine error definition understanding and alignment with human judgment. We show that MESA's components enable thorough error detection, consistent rating, and adaptability to custom error guidelines. Using GPT-4o as its backbone, MESA achieves mid to high Point-Biserial correlation with human judgment in error detection and mid Spearman and Kendall correlation in reflecting error impact on summary quality, on average 0.25 higher than previous methods. The framework's flexibility in adapting to custom error guidelines makes it suitable for various tasks with limited human-labeled data.

Submitted to arXiv on 27 Nov. 2024

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2411.18444v1

, , , , In this study, we address the challenge of evaluating meeting summaries automatically using natural language generation (NLG) systems. Meeting summaries play a crucial role in professional environments by providing references, updates for absentees, and reinforcing key topics discussed during meetings. With the integration of summarization services into popular digital meeting platforms like Zoom, Microsoft Teams, and Google Meet, the importance of generating high-quality meeting summaries has increased significantly. The evaluation of generated meeting summaries is traditionally done through costly and time-consuming human assessment. To overcome this limitation, there is a need for an automatic evaluator that can provide insights along with scoring to enable sophisticated techniques such as feedback-based summary refinement and reinforcement learning from AI feedback. However, established automatic metrics like ROUGE, BERTScore, and BARTScore show a relatively low correlation with human judgment and often fail to detect errors accurately or reflect error impact on summary quality. Large language models (LLMs) have emerged as potential evaluators for text summarization tasks by assigning Likert scores based on predefined guidelines. However, existing LLM-based approaches face limitations in meeting summarization contexts due to oversight of typical errors such as structural presentation issues and coreference problems. Moreover, subjective annotation guidelines may lead to inconsistent interpretations by LLMs, resulting in unreliable evaluations. To address these challenges, we introduce MESA (Meeting Summary Assessor), a multi-stage LLM-based framework designed to mimic human evaluation approaches. MESA operates on three levels: error detection at the individual error type level, multi-agent discussion for decision refinement, and feedback-based self-training to improve error definition understanding and alignment with human judgment. By leveraging GPT-4o as its backbone model, MESA achieves mid to high correlations with human judgment in error detection and reflects error impact on summary quality more accurately than previous methods. Overall, MESA's flexibility in adapting to custom error guidelines makes it suitable for various tasks with limited human-labeled data. The framework enables thorough error detection, consistent rating, and adaptability to custom error definitions while providing valuable insights into meeting summary quality assessment without relying solely on costly human evaluation methods.
Created on 20 Mar. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.