Towards Explainable Evaluation Metrics for Machine Translation

AI-generated keywords: explainable machine translation metrics transparency evaluation ChatGPT

AI-generated Key Points

  • Shift from traditional lexical overlap metrics like BLEU to newer black-box models such as COMET and BERTScore
  • Preference for classical metrics due to transparent decision-making processes
  • Importance of explainability in promoting adoption of high-quality metrics
  • Taxonomy of previous efforts in explainable MT evaluation showcasing various methods and techniques used in the field
  • Discussion on underexplored research directions and potential future paths for improving MT evaluation through explainability
  • Utilization of large language models like ChatGPT and GPT4 in the context of explainable MT metrics
  • Guidance provided for researchers and metric developers to enhance understanding and explanation of MT metrics
  • Envisioned benefits including improved MT metrics, translation selection, and semi-automatic labeling
  • Background information on machine translation evaluation metrics explaining key dimensions such as input type, granularity, quality aspect, learning objective, etc.
  • Comprehensive guide offering insights into current trends, future possibilities, and potential advancements in the field
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Christoph Leiter, Piyawat Lertvittayakumjorn, Marina Fomicheva, Wei Zhao, Yang Gao, Steffen Eger

Preprint. We published an earlier version of this paper (arXiv:2203.11131) under a different title. Both versions consider the conceptualization of explainable metrics and are overall similar. However, the new version puts a stronger emphasis on the survey of approaches for the explanation of MT metrics including the latest LLM based approaches
License: CC BY 4.0

Abstract: Unlike classical lexical overlap metrics such as BLEU, most current evaluation metrics for machine translation (for example, COMET or BERTScore) are based on black-box large language models. They often achieve strong correlations with human judgments, but recent research indicates that the lower-quality classical metrics remain dominant, one of the potential reasons being that their decision processes are more transparent. To foster more widespread acceptance of novel high-quality metrics, explainability thus becomes crucial. In this concept paper, we identify key properties as well as key goals of explainable machine translation metrics and provide a comprehensive synthesis of recent techniques, relating them to our established goals and properties. In this context, we also discuss the latest state-of-the-art approaches to explainable metrics based on generative models such as ChatGPT and GPT4. Finally, we contribute a vision of next-generation approaches, including natural language explanations. We hope that our work can help catalyze and guide future research on explainable evaluation metrics and, mediately, also contribute to better and more transparent machine translation systems.

Submitted to arXiv on 22 Jun. 2023

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2306.13041v1

This paper delves into the realm of explainable machine translation (MT) metrics, highlighting the shift from traditional lexical overlap metrics like BLEU to newer black-box models such as COMET and BERTScore. While these newer models show strong correlations with human judgments, there is still a preference for classical metrics due to their transparent decision-making processes. To address this disparity and promote the adoption of high-quality metrics, the concept of explainability becomes paramount. The paper outlines a taxonomy of previous efforts in explainable MT evaluation, showcasing various methods and techniques used in the field. Additionally, it discusses underexplored research directions and potential future paths for improving MT evaluation through explainability. The utilization of large language models like ChatGPT and GPT4 is also explored in this context. Furthermore, the authors aim to solidify the field of explainable MT metrics by providing guidance for researchers and metric developers looking to enhance their understanding and explanation of MT metrics. They envision that this work will not only lead to improved MT metrics but also benefit other applications such as translation selection and semi-automatic labeling. The paper also provides background information on machine translation evaluation metrics, explaining key dimensions such as input type, granularity, quality aspect, learning objective, etc. This foundational knowledge sets the stage for a comprehensive understanding of explainable MT metrics. Overall, this paper serves as a comprehensive guide to explainable machine translation metrics , offering insights into current trends , future possibilities , and potential advancements in the field. By emphasizing transparency and clarity in evaluating machine translations the authors hope to contribute to better MT systems in the future.
Created on 16 Apr. 2024

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.