Flowing from Words to Pixels: A Framework for Cross-Modality Evolution
AI-generated Key Points
⚠The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.
- Authors Qihao Liu, Xi Yin, Alan Yuille, Andrew Brown, and Mannat Singh investigate the impact of diffusion models and flow matching on media generation.
- Traditional approach involves learning a mapping from Gaussian noise to target media distribution.
- Proposed paradigm explores whether flow matching models can directly map one modality's distribution to another without relying on noise distributions or conditioning mechanisms.
- CrossFlow framework utilizes Variational Encoders for input data and enables Classifier-free guidance.
- CrossFlow with a vanilla transformer slightly outperforms standard flow matching for text-to-image tasks.
- CrossFlow demonstrates better scalability with training steps and model size while allowing for meaningful edits in output space through latent arithmetic operations.
- Competitive with or superior to state-of-the-art methods in various cross-modal and intra-modal mapping tasks such as image captioning, depth estimation, and image super-resolution.
Authors: Qihao Liu, Xi Yin, Alan Yuille, Andrew Brown, Mannat Singh
Abstract: Diffusion models, and their generalization, flow matching, have had a remarkable impact on the field of media generation. Here, the conventional approach is to learn the complex mapping from a simple source distribution of Gaussian noise to the target media distribution. For cross-modal tasks such as text-to-image generation, this same mapping from noise to image is learnt whilst including a conditioning mechanism in the model. One key and thus far relatively unexplored feature of flow matching is that, unlike Diffusion models, they are not constrained for the source distribution to be noise. Hence, in this paper, we propose a paradigm shift, and ask the question of whether we can instead train flow matching models to learn a direct mapping from the distribution of one modality to the distribution of another, thus obviating the need for both the noise distribution and conditioning mechanism. We present a general and simple framework, CrossFlow, for cross-modal flow matching. We show the importance of applying Variational Encoders to the input data, and introduce a method to enable Classifier-free guidance. Surprisingly, for text-to-image, CrossFlow with a vanilla transformer without cross attention slightly outperforms standard flow matching, and we show that it scales better with training steps and model size, while also allowing for interesting latent arithmetic which results in semantically meaningful edits in the output space. To demonstrate the generalizability of our approach, we also show that CrossFlow is on par with or outperforms the state-of-the-art for various cross-modal / intra-modal mapping tasks, viz. image captioning, depth estimation, and image super-resolution. We hope this paper contributes to accelerating progress in cross-modal media generation.
Ask questions about this paper to our AI assistant
You can also chat with multiple papers at once here.
⚠The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.
Assess the quality of the AI-generated content by voting
Score: 0
Why do we need votes?
Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.
Similar papers summarized with our AI tools
Navigate through even more similar papers through a
tree representationLook for similar papers (in beta version)
By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.
Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.