Is my Meeting Summary Good? Estimating Quality with a Multi-LLM Evaluator

AI-generated keywords: Meeting summaries

AI-generated Key Points

Meeting summaries are crucial in professional environments for providing references, updates for absentees, and reinforcing key topics discussed during meetings.
The integration of summarization services into popular digital meeting platforms has increased the importance of generating high-quality meeting summaries.
Traditional evaluation of meeting summaries is done through costly and time-consuming human assessment.
Existing automatic metrics like ROUGE, BERTScore, and BARTScore show low correlation with human judgment in evaluating meeting summaries.
MESA (Meeting Summary Assessor) is a multi-stage LLM-based framework designed to mimic human evaluation approaches for meeting summary assessment.

Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Frederic Kirstein, Terry Ruas, Bela Gipp

COLING 2025 Industry Track

arXiv: 2411.18444v1 - DOI (cs.CL)

License: CC BY-SA 4.0

Abstract: The quality of meeting summaries generated by natural language generation (NLG) systems is hard to measure automatically. Established metrics such as ROUGE and BERTScore have a relatively low correlation with human judgments and fail to capture nuanced errors. Recent studies suggest using large language models (LLMs), which have the benefit of better context understanding and adaption of error definitions without training on a large number of human preference judgments. However, current LLM-based evaluators risk masking errors and can only serve as a weak proxy, leaving human evaluation the gold standard despite being costly and hard to compare across studies. In this work, we present MESA, an LLM-based framework employing a three-step assessment of individual error types, multi-agent discussion for decision refinement, and feedback-based self-training to refine error definition understanding and alignment with human judgment. We show that MESA's components enable thorough error detection, consistent rating, and adaptability to custom error guidelines. Using GPT-4o as its backbone, MESA achieves mid to high Point-Biserial correlation with human judgment in error detection and mid Spearman and Kendall correlation in reflecting error impact on summary quality, on average 0.25 higher than previous methods. The framework's flexibility in adapting to custom error guidelines makes it suitable for various tasks with limited human-labeled data.

Submitted to arXiv on 27 Nov. 2024

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2411.18444v1

Comprehensive Summary
Key points
Layman's Summary
Blog article

, , , , In this study, we address the challenge of evaluating meeting summaries automatically using natural language generation (NLG) systems. Meeting summaries play a crucial role in professional environments by providing references, updates for absentees, and reinforcing key topics discussed during meetings. With the integration of summarization services into popular digital meeting platforms like Zoom, Microsoft Teams, and Google Meet, the importance of generating high-quality meeting summaries has increased significantly. The evaluation of generated meeting summaries is traditionally done through costly and time-consuming human assessment. To overcome this limitation, there is a need for an automatic evaluator that can provide insights along with scoring to enable sophisticated techniques such as feedback-based summary refinement and reinforcement learning from AI feedback. However, established automatic metrics like ROUGE, BERTScore, and BARTScore show a relatively low correlation with human judgment and often fail to detect errors accurately or reflect error impact on summary quality. Large language models (LLMs) have emerged as potential evaluators for text summarization tasks by assigning Likert scores based on predefined guidelines. However, existing LLM-based approaches face limitations in meeting summarization contexts due to oversight of typical errors such as structural presentation issues and coreference problems. Moreover, subjective annotation guidelines may lead to inconsistent interpretations by LLMs, resulting in unreliable evaluations. To address these challenges, we introduce MESA (Meeting Summary Assessor), a multi-stage LLM-based framework designed to mimic human evaluation approaches. MESA operates on three levels: error detection at the individual error type level, multi-agent discussion for decision refinement, and feedback-based self-training to improve error definition understanding and alignment with human judgment. By leveraging GPT-4o as its backbone model, MESA achieves mid to high correlations with human judgment in error detection and reflects error impact on summary quality more accurately than previous methods. Overall, MESA's flexibility in adapting to custom error guidelines makes it suitable for various tasks with limited human-labeled data. The framework enables thorough error detection, consistent rating, and adaptability to custom error definitions while providing valuable insights into meeting summary quality assessment without relying solely on costly human evaluation methods.

- Meeting summaries are crucial in professional environments for providing references, updates for absentees, and reinforcing key topics discussed during meetings.
- The integration of summarization services into popular digital meeting platforms has increased the importance of generating high-quality meeting summaries.
- Traditional evaluation of meeting summaries is done through costly and time-consuming human assessment.
- Existing automatic metrics like ROUGE, BERTScore, and BARTScore show low correlation with human judgment in evaluating meeting summaries.
- MESA (Meeting Summary Assessor) is a multi-stage LLM-based framework designed to mimic human evaluation approaches for meeting summary assessment.

SummaryMeeting summaries are important in workplaces for keeping track of what was discussed, updating those who were not there, and remembering key points. Some digital meeting platforms now have tools to help create good meeting summaries. Checking the quality of meeting summaries used to be done by people which took a lot of time and money. Some automatic tools like ROUGE, BERTScore, and BARTScore don't always match up with human judgment when evaluating meeting summaries. MESA is a special computer program that tries to evaluate meeting summaries like humans do. Definitions- Meeting summaries: Short notes or reports that capture the main points discussed during a meeting. - Absentees: People who were not present at the meeting. - Integration: Combining different things together to work as one. - Evaluation: Assessing or judging something based on specific criteria. - Framework: A structure or plan that helps organize and guide activities.

Introduction

In today's fast-paced professional world, meetings are an essential part of communication and decision-making processes. With the rise of remote work and virtual meetings, it has become even more crucial to have accurate and comprehensive meeting summaries for reference and updates. However, manually generating these summaries can be time-consuming and costly. To address this issue, natural language generation (NLG) systems have been integrated into popular digital meeting platforms like Zoom, Microsoft Teams, and Google Meet. The evaluation of generated meeting summaries is traditionally done through human assessment. This process is not only time-consuming but also subjective as different individuals may interpret the quality of a summary differently. Therefore, there is a need for an automatic evaluator that can provide objective insights along with scoring to enable sophisticated techniques such as feedback-based summary refinement and reinforcement learning from AI feedback. In this research paper, titled "MESA: A Multi-Stage LLM-Based Framework for Meeting Summary Evaluation," the authors propose a framework that aims to mimic human evaluation approaches in assessing the quality of meeting summaries automatically.

The Limitations of Existing Automatic Metrics

Existing automatic metrics like ROUGE, BERTScore, and BARTScore have been widely used in text summarization tasks but show low correlations with human judgment when applied to meeting summaries. These metrics often fail to detect errors accurately or reflect error impact on summary quality due to their focus on content overlap rather than structural presentation issues or coreference problems.

The Emergence of Large Language Models (LLMs)

Large language models (LLMs) have emerged as potential evaluators for text summarization tasks by assigning Likert scores based on predefined guidelines. However, existing LLM-based approaches face limitations in meeting summarization contexts due to oversight of typical errors such as structural presentation issues and coreference problems. Moreover, subjective annotation guidelines may lead to inconsistent interpretations by LLMs, resulting in unreliable evaluations. This highlights the need for a more comprehensive and adaptable framework for meeting summary evaluation.

The MESA Framework

To address these challenges, the authors introduce MESA (Meeting Summary Assessor), a multi-stage LLM-based framework designed to mimic human evaluation approaches. MESA operates on three levels: error detection at the individual error type level, multi-agent discussion for decision refinement, and feedback-based self-training to improve error definition understanding and alignment with human judgment. The backbone model of MESA is GPT-4o, which has been trained on a large dataset of meeting summaries to understand common errors and their impact on summary quality. By leveraging this model, MESA achieves mid to high correlations with human judgment in error detection and reflects error impact on summary quality more accurately than previous methods.

Flexibility and Adaptability

One of the key strengths of the MESA framework is its flexibility in adapting to custom error guidelines. This makes it suitable for various tasks with limited human-labeled data. The framework enables thorough error detection, consistent rating, and adaptability to custom error definitions while providing valuable insights into meeting summary quality assessment without relying solely on costly human evaluation methods.

Conclusion

In conclusion, the research paper "MESA: A Multi-Stage LLM-Based Framework for Meeting Summary Evaluation" presents an innovative approach to automatically evaluating meeting summaries using large language models. The proposed framework addresses limitations of existing automatic metrics by mimicking human evaluation approaches through multi-stage processing and leveraging GPT-4o as its backbone model. MESA's ability to detect errors accurately and reflect their impact on summary quality makes it a valuable tool for improving NLG systems' performance in generating high-quality meeting summaries. Its flexibility in adapting to custom guidelines also makes it suitable for various tasks with limited labeled data. Overall, this research paper contributes to the advancement of meeting summary evaluation and has the potential to improve the efficiency and accuracy of NLG systems in professional environments.

Created on 20 Mar. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Similar papers summarized with our AI tools

62.5%

MERA: A Comprehensive LLM Evaluation in Russian

cs.CL

60.4%

Towards Explainable Evaluation Metrics for Machine Translation

cs.CL

60.1%

Survey on Factuality in Large Language Models: Knowledge, Retrieval and Domai…

cs.CL

60.0%

Evaluating Correctness and Faithfulness of Instruction-Following Models for Q…

cs.CL

59.2%

LLM Evaluators Recognize and Favor Their Own Generations

cs.CL

58.7%

Benchmarking Large Language Models for News Summarization

cs.CL

58.6%

BARTScore: Evaluating Generated Text as Text Generation

cs.CL

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.