InstaFlow: One Step is Enough for High-Quality Diffusion-Based Text-to-Image Generation

AI-generated keywords: Text-to-image generation

AI-generated Key Points

The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

  • Diffusion models have revolutionized text-to-image generation with their exceptional quality and creativity.
  • Previous attempts to improve sampling speed and reduce computational costs through distillation have not resulted in a functional one-step model.
  • The paper "InstaFlow" introduces Rectified Flow, including the innovative reflow procedure, to transform Stable Diffusion into an ultra-fast one-step model for text-to-image generation.
  • InstaFlow achieves remarkable image quality with an FID of $23.3$ on MS COCO 2017-5k dataset, surpassing the previous state-of-the-art technique by a significant margin.
  • Leveraging an expanded network with 1.7B parameters further improves the FID score to $22.4$, showcasing both efficiency and effectiveness in high-quality image synthesis tasks.
  • InstaFlow sets a new benchmark for speed in image generation tasks, achieving an outstanding FID of $13.1$ on MS COCO 2014-30k dataset in just $0.09$ seconds.
  • Training InstaFlow only requires 199 A100 GPU days, making it powerful and cost-effective for practical implementation.
  • Codes and pre-trained models for InstaFlow are available at \url{github.com/gnobitab/InstaFlow}, enabling further exploration and replication of results.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Xingchao Liu, Xiwen Zhang, Jianzhu Ma, Jian Peng, Qiang Liu

ICLR 2024

Abstract: Diffusion models have revolutionized text-to-image generation with its exceptional quality and creativity. However, its multi-step sampling process is known to be slow, often requiring tens of inference steps to obtain satisfactory results. Previous attempts to improve its sampling speed and reduce computational costs through distillation have been unsuccessful in achieving a functional one-step model. In this paper, we explore a recent method called Rectified Flow, which, thus far, has only been applied to small datasets. The core of Rectified Flow lies in its \emph{reflow} procedure, which straightens the trajectories of probability flows, refines the coupling between noises and images, and facilitates the distillation process with student models. We propose a novel text-conditioned pipeline to turn Stable Diffusion (SD) into an ultra-fast one-step model, in which we find reflow plays a critical role in improving the assignment between noise and images. Leveraging our new pipeline, we create, to the best of our knowledge, the first one-step diffusion-based text-to-image generator with SD-level image quality, achieving an FID (Frechet Inception Distance) of $23.3$ on MS COCO 2017-5k, surpassing the previous state-of-the-art technique, progressive distillation, by a significant margin ($37.2$ $\rightarrow$ $23.3$ in FID). By utilizing an expanded network with 1.7B parameters, we further improve the FID to $22.4$. We call our one-step models \emph{InstaFlow}. On MS COCO 2014-30k, InstaFlow yields an FID of $13.1$ in just $0.09$ second, the best in $\leq 0.1$ second regime, outperforming the recent StyleGAN-T ($13.9$ in $0.1$ second). Notably, the training of InstaFlow only costs 199 A100 GPU days. Codes and pre-trained models are available at \url{github.com/gnobitab/InstaFlow}.

Submitted to arXiv on 12 Sep. 2023

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2309.06380v2

This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

, , , , In the realm of text-to-image generation, diffusion models have emerged as a game-changer due to their remarkable quality and creativity. However, the multi-step sampling process inherent in these models has been a bottleneck, often necessitating numerous inference steps to achieve satisfactory results. Previous efforts to enhance sampling speed and reduce computational costs through distillation have fallen short in producing a functional one-step model. In this groundbreaking paper titled "InstaFlow: One Step is Enough for High-Quality Diffusion-Based Text-to-Image Generation," authors Xingchao Liu, Xiwen Zhang, Jianzhu Ma, Jian Peng, and Qiang Liu delve into the exploration of Rectified Flow, a recent method that has primarily been applied to small datasets. At the core of Rectified Flow lies the innovative \emph{reflow} procedure, which serves to straighten probability flow trajectories, refine noise-image coupling, and facilitate distillation with student models. The researchers propose a novel text-conditioned pipeline that transforms Stable Diffusion (SD) into an ultra-fast one-step model. Through their experimentation with reflow, they discover its pivotal role in enhancing the alignment between noise and images. Leveraging this new pipeline leads to the development of InstaFlow - the first one-step diffusion-based text-to-image generator capable of achieving SD-level image quality. Impressively, InstaFlow achieves an FID (Frechet Inception Distance) of $23.3$ on MS COCO 2017-5k dataset, surpassing the previous state-of-the-art technique known as progressive distillation by a significant margin ($37.2$ $\rightarrow$ $23.3$ in FID). By utilizing an expanded network with 1.7B parameters, the researchers further improve the FID score to $22.4$. Notably, on MS COCO 2014-30k dataset, InstaFlow achieves an outstanding FID of $13.1$ in just $0.09$ seconds - setting a new benchmark for speed in image generation tasks within $\leq 0.1$ second regime. This remarkable performance outshines recent approaches like StyleGAN-T ($13.9$ in $0.1$ second), showcasing the efficiency and effectiveness of InstaFlow in high-quality image synthesis tasks while maintaining rapid processing speeds. It is worth mentioning that training InstaFlow only requires 199 A100 GPU days - making it not only powerful but also cost-effective for practical implementation. For those interested in exploring further or replicating these results, codes and pre-trained models are readily available at \url{github.com/gnobitab/InstaFlow}. This research presented at ICLR 2024 marks a significant advancement in text-to-image generation technology and sets a new standard for high-quality image synthesis with unparalleled speed and efficiency.
Created on 02 Aug. 2024

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

The license of this specific paper does not allow us to build upon its content and the summarizing tools will be run using the paper metadata rather than the full article. However, it still does a good job, and you can also try our tools on papers with more open licenses.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.