Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

AI-generated keywords: Text-to-Image Generation

AI-generated Key Points

The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

  • Introduction of the Pathways Autoregressive Text-to-Image (Parti) model
  • Goal: Generate high-fidelity photorealistic images with complex compositions and world knowledge
  • Approach: Treat text-to-image generation as a sequence-to-sequence modeling problem, using image tokens as target outputs
  • Utilization of ViT-VQGAN, a Transformer-based image tokenizer, to encode images as sequences of discrete tokens
  • Scaling up the encoder-decoder Transformer model to 20 billion parameters for consistent quality improvements
  • Impressive performance metrics: zero-shot FID score of 7.23 and finetuned FID score of 3.22 on MS COCO dataset
  • Effectiveness demonstrated through analysis on Localized Narratives and PartiPrompts (P2) benchmark
  • Acknowledgment of limitations in the models used in Parti, highlighting areas for further improvement
  • Overall, a simple yet powerful approach for generating high-quality images from textual descriptions
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, Ben Hutchinson, Wei Han, Zarana Parekh, Xin Li, Han Zhang, Jason Baldridge, Yonghui Wu

Preprint

Abstract: We present the Pathways Autoregressive Text-to-Image (Parti) model, which generates high-fidelity photorealistic images and supports content-rich synthesis involving complex compositions and world knowledge. Parti treats text-to-image generation as a sequence-to-sequence modeling problem, akin to machine translation, with sequences of image tokens as the target outputs rather than text tokens in another language. This strategy can naturally tap into the rich body of prior work on large language models, which have seen continued advances in capabilities and performance through scaling data and model sizes. Our approach is simple: First, Parti uses a Transformer-based image tokenizer, ViT-VQGAN, to encode images as sequences of discrete tokens. Second, we achieve consistent quality improvements by scaling the encoder-decoder Transformer model up to 20B parameters, with a new state-of-the-art zero-shot FID score of 7.23 and finetuned FID score of 3.22 on MS-COCO. Our detailed analysis on Localized Narratives as well as PartiPrompts (P2), a new holistic benchmark of over 1600 English prompts, demonstrate the effectiveness of Parti across a wide variety of categories and difficulty aspects. We also explore and highlight limitations of our models in order to define and exemplify key areas of focus for further improvements. See https://parti.research.google/ for high-resolution images.

Submitted to arXiv on 22 Jun. 2022

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2206.10789v1

This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

The Pathways Autoregressive Text-to-Image (Parti) model is introduced in this paper, aiming to generate high-fidelity photorealistic images while supporting content-rich synthesis with complex compositions and world knowledge. The approach treats text-to-image generation as a sequence-to-sequence modeling problem, similar to machine translation, but with image tokens as the target outputs instead of text tokens in another language. By leveraging prior work on large language models, which have shown advancements through scaling data and model sizes, Parti utilizes a Transformer-based image tokenizer called ViT-VQGAN to encode images as sequences of discrete tokens. To achieve consistent quality improvements, the encoder-decoder Transformer model of Parti is scaled up to 20 billion parameters. This scaling results in impressive performance metrics, including a state-of-the art zero shot FID score of 7.23 and a finetuned FID score of 3.22 on MS COCO dataset. The effectiveness of Parti is demonstrated through detailed analysis on Localized Narratives and PartiPrompts (P2), a comprehensive benchmark consisting of over 1600 English prompts across various categories and difficulty aspects. Despite its success, the paper also acknowledges certain limitations of the models used in Parti which are explored and highlighted to identify key areas for further improvement. Overall, the Pathways Autoregressive Text-to-Image model presents a simple yet powerful approach for generating high quality images from textual descriptions. For more information and access to high resolution images generated by Parti, visit https://parti.research.google/.
Created on 26 Dec. 2023

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

The license of this specific paper does not allow us to build upon its content and the summarizing tools will be run using the paper metadata rather than the full article. However, it still does a good job, and you can also try our tools on papers with more open licenses.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.