Revisiting Text-to-Image Evaluation with Gecko: On Metrics, Prompts, and Human Ratings

AI-generated keywords: Text-to-image generative models

AI-generated Key Points

  • Growing need for text-to-image (T2I) generative models to align generated images closely with given prompts
  • Recent study introduces Gecko2K benchmark to categorize prompts into sub-skills and identify challenging areas for T2I models
  • Analysis of over 100K human ratings across templates and models reveals differences due to prompt ambiguity, metric quality, and model discrepancies
  • Introduction of a new QA-based auto-eval metric that shows better correlation with human ratings compared to existing metrics
  • Development of Gecko(S) prompt set with fine-grained skills coverage for identifying T2I model failures
  • Importance of using reliable prompts with high inter-annotator agreement for consistent model ordering
  • Fine-grained annotation templates yield more consistent results compared to coarse-grained ones
  • Significance of standardizing model evaluation processes by considering benchmark selection and annotation template quality
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Olivia Wiles, Chuhan Zhang, Isabela Albuquerque, Ivana Kajić, Su Wang, Emanuele Bugliarello, Yasumasa Onoe, Chris Knutsen, Cyrus Rashtchian, Jordi Pont-Tuset, Aida Nematzadeh

Data and code will be released at: https://github.com/google-deepmind/gecko_benchmark_t2i
License: CC BY 4.0

Abstract: While text-to-image (T2I) generative models have become ubiquitous, they do not necessarily generate images that align with a given prompt. While previous work has evaluated T2I alignment by proposing metrics, benchmarks, and templates for collecting human judgements, the quality of these components is not systematically measured. Human-rated prompt sets are generally small and the reliability of the ratings -- and thereby the prompt set used to compare models -- is not evaluated. We address this gap by performing an extensive study evaluating auto-eval metrics and human templates. We provide three main contributions: (1) We introduce a comprehensive skills-based benchmark that can discriminate models across different human templates. This skills-based benchmark categorises prompts into sub-skills, allowing a practitioner to pinpoint not only which skills are challenging, but at what level of complexity a skill becomes challenging. (2) We gather human ratings across four templates and four T2I models for a total of >100K annotations. This allows us to understand where differences arise due to inherent ambiguity in the prompt and where they arise due to differences in metric and model quality. (3) Finally, we introduce a new QA-based auto-eval metric that is better correlated with human ratings than existing metrics for our new dataset, across different human templates, and on TIFA160.

Submitted to arXiv on 25 Apr. 2024

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2404.16820v1

, , , , In the realm of text-to-image (T2I) generative models, there is a growing need to ensure that generated images align closely with given prompts. Previous efforts have focused on evaluating T2I alignment through metrics, benchmarks, and human judgement templates. However, the quality and reliability of these components have not been systematically measured. This gap is addressed in a recent study that delves into auto-eval metrics and human templates to provide a more comprehensive understanding. The study introduces Gecko2K, a detailed benchmark that categorizes prompts into sub-skills to pinpoint challenging areas for T2I models. By gathering over 100K human ratings across four templates and four T2I models, the research sheds light on where differences arise due to prompt ambiguity versus metric and model quality discrepancies. Additionally, a new QA-based auto-eval metric is introduced, showcasing better correlation with human ratings compared to existing metrics. Key contributions include the development of Gecko(S), a discriminative prompt set with fine-grained skills coverage for identifying T2I model failures. The analysis highlights the impact of annotation templates on model evaluation and emphasizes the importance of using reliable prompts with high inter-annotator agreement for consistent model ordering. Furthermore, findings suggest that fine-grained annotation templates yield more consistent results compared to coarse-grained ones. Overall, the study underscores the significance of standardizing model evaluation processes by considering both benchmark selection and annotation template quality. While the proposed metric shows promise for reliable model comparisons, future directions may involve incorporating confidence thresholds alongside metric scores. Anecdotal evidence suggests that annotators spend more time rating prompt-image pairs using certain templates, indicating potential variations in evaluation efficiency based on template type.
Created on 30 Apr. 2024

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.