MMMG: A Massive, Multidisciplinary, Multi-Tier Generation Benchmark for Text-to-Image Reasoning

AI-generated keywords: Knowledge Image Generation Massive Multi-Discipline Multi-Tier Knowledge-Image Generation Benchmark Multimodal Reasoning Factual Fidelity FLUX-Reason

AI-generated Key Points

  • The Massive Multi-Discipline Multi-Tier Knowledge-Image Generation Benchmark (MMMG) is introduced to assess reasoning capabilities of image generation models in knowledge image generation.
  • Knowledge images are essential for human civilization and learning processes, requiring complex multimodal reasoning to combine world knowledge with pixel-level grounding.
  • MMMG consists of 4,456 expert-validated knowledge image-prompt pairs across different disciplines, educational levels, and formats, using a unified Knowledge Graph representation for evaluation complexity.
  • The MMMG-Score metric is introduced to evaluate generated knowledge images based on factual fidelity and visual clarity.
  • Evaluations of 16 state-of-the-art text-to-image generation models reveal significant reasoning deficits such as low entity fidelity and weak relations.
  • The authors release FLUX-Reason as an open baseline model trained on 16,000 curated knowledge image-prompt pairs to drive further advancements in the field.
  • Detailed analyses show varying levels of success in capturing essential entities and dependencies within tasks, emphasizing the importance of robust reasoning capabilities in knowledge image generation tasks.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Yuxuan Luo, Yuhui Yuan, Junwen Chen, Haonan Cai, Ziyi Yue, Yuwei Yang, Fatima Zohra Daha, Ji Li, Zhouhui Lian

85 pages, 70 figures, code: https://github.com/MMMGBench/MMMG, project page: https://mmmgbench.github.io/
License: CC BY-SA 4.0

Abstract: In this paper, we introduce knowledge image generation as a new task, alongside the Massive Multi-Discipline Multi-Tier Knowledge-Image Generation Benchmark (MMMG) to probe the reasoning capability of image generation models. Knowledge images have been central to human civilization and to the mechanisms of human learning -- a fact underscored by dual-coding theory and the picture-superiority effect. Generating such images is challenging, demanding multimodal reasoning that fuses world knowledge with pixel-level grounding into clear explanatory visuals. To enable comprehensive evaluation, MMMG offers 4,456 expert-validated (knowledge) image-prompt pairs spanning 10 disciplines, 6 educational levels, and diverse knowledge formats such as charts, diagrams, and mind maps. To eliminate confounding complexity during evaluation, we adopt a unified Knowledge Graph (KG) representation. Each KG explicitly delineates a target image's core entities and their dependencies. We further introduce MMMG-Score to evaluate generated knowledge images. This metric combines factual fidelity, measured by graph-edit distance between KGs, with visual clarity assessment. Comprehensive evaluations of 16 state-of-the-art text-to-image generation models expose serious reasoning deficits -- low entity fidelity, weak relations, and clutter -- with GPT-4o achieving an MMMG-Score of only 50.20, underscoring the benchmark's difficulty. To spur further progress, we release FLUX-Reason (MMMG-Score of 34.45), an effective and open baseline that combines a reasoning LLM with diffusion models and is trained on 16,000 curated knowledge image-prompt pairs.

Submitted to arXiv on 12 Jun. 2025

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2506.10963v2

The authors present the Massive Multi-Discipline Multi-Tier Knowledge-Image Generation Benchmark (MMMG) to assess the reasoning capabilities of image generation models in the novel task of knowledge image generation. Knowledge images are crucial for human civilization and learning processes, and require complex multimodal reasoning to combine world knowledge with pixel-level grounding. MMMG consists of 4,456 expert-validated knowledge image-prompt pairs across various disciplines, educational levels, and formats. A unified Knowledge Graph representation is adopted for each pair to facilitate evaluation complexity. The authors also introduce the MMMG-Score metric to evaluate generated knowledge images based on factual fidelity and visual clarity. Evaluations of 16 state-of-the-art text-to-image generation models reveal significant reasoning deficits such as low entity fidelity and weak relations. To drive further advancements in this field, the authors release FLUX-Reason as an effective open baseline model trained on 16,000 curated knowledge image-prompt pairs. Detailed analyses showcase varying levels of success in capturing essential entities and dependencies within given tasks. This study highlights the importance of robust reasoning capabilities in knowledge image generation tasks and sets a high standard for future advancements in this domain.
Created on 23 Sep. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.