Positive-Augmented Contrastive Learning for Image and Video Captioning Evaluation

AI-generated keywords: Positive-Augmented Contrastive Learning Image and Video Captioning Evaluation CLIP model PAC-S Cross-modal tasks

AI-generated Key Points

The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

  • Authors Sara Sarto, Manuele Barraco, Marcella Cornia, Lorenzo Baraldi, and Rita Cucchiara introduce a novel evaluation metric called Positive-Augmented Contrastive learning Score (PAC-S) for image captioning.
  • PAC-S combines contrastive visual-semantic space learning with the incorporation of generated images and text on curated data to enhance existing reference-based and reference-free metrics.
  • Through experiments across multiple datasets, PAC-S achieves the highest correlation with human judgments for both images and videos, surpassing existing metrics like CIDEr, SPICE, and CLIP-Score.
  • The study evaluates the system-level correlation of PAC-S when applied to popular image captioning approaches while examining the impact of utilizing different cross-modal features.
  • Highlighted as a highlight paper at CVPR 2023, this research significantly advances evaluation methods in image and video captioning.
  • The availability of their source code and trained models on GitHub ensures transparency and reproducibility in further studies within this domain.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Sara Sarto, Manuele Barraco, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara

CVPR 2023 (highlight paper)

Abstract: The CLIP model has been recently proven to be very effective for a variety of cross-modal tasks, including the evaluation of captions generated from vision-and-language architectures. In this paper, we propose a new recipe for a contrastive-based evaluation metric for image captioning, namely Positive-Augmented Contrastive learning Score (PAC-S), that in a novel way unifies the learning of a contrastive visual-semantic space with the addition of generated images and text on curated data. Experiments spanning several datasets demonstrate that our new metric achieves the highest correlation with human judgments on both images and videos, outperforming existing reference-based metrics like CIDEr and SPICE and reference-free metrics like CLIP-Score. Finally, we test the system-level correlation of the proposed metric when considering popular image captioning approaches, and assess the impact of employing different cross-modal features. Our source code and trained models are publicly available at: https://github.com/aimagelab/pacscore.

Submitted to arXiv on 21 Mar. 2023

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2303.12112v3

This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

In their paper titled "Positive-Augmented Contrastive Learning for Image and Video Captioning Evaluation," authors Sara Sarto, Manuele Barraco, Marcella Cornia, Lorenzo Baraldi, and Rita Cucchiara introduce a novel evaluation metric called Positive-Augmented Contrastive learning Score (PAC-S) for image captioning. The proposed method combines contrastive visual-semantic space learning with the incorporation of generated images and text on curated data to improve upon existing reference-based and reference-free metrics. Through experiments across multiple datasets, the researchers demonstrate that PAC-S achieves the highest correlation with human judgments for both images and videos. This new metric surpasses existing metrics like CIDEr and SPICE as well as CLIP-Score. Additionally, the study evaluates the system-level correlation of PAC-S when applied to popular image captioning approaches while assessing the impact of utilizing different cross-modal features. Highlighted as a highlight paper at CVPR 2023, this research contributes significantly to advancing evaluation methods in image and video captioning. The availability of their source code and trained models on GitHub ensures transparency and reproducibility in further studies within this domain.
Created on 20 Mar. 2024

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

The license of this specific paper does not allow us to build upon its content and the summarizing tools will be run using the paper metadata rather than the full article. However, it still does a good job, and you can also try our tools on papers with more open licenses.

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.