Goku: Flow Based Video Generative Foundation Models

AI-generated keywords: Goku joint image-and-video generation models rectified flow Transformers high-quality visual generation data curation pipeline

AI-generated Key Points

⚠The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

Goku is a family of joint image-and-video generation models that utilize rectified flow Transformers for exceptional performance.
Foundational elements crucial for high-quality visual generation include data curation pipeline, model architecture design, flow formulation, and advanced infrastructure for large-scale training.
Goku models demonstrate superior performance in qualitative and quantitative evaluations across various tasks.
Impressive scores achieved by Goku models: 0.76 on GenEval and 83.65 on DPG-Bench for text-to-image generation, as well as 84.85 on VBench for text-to-video tasks.
The extensive list of authors involved in the research includes Shoufa Chen, Chongjian Ge, Yuqi Zhang, Yida Zhang, Fengda Zhu, Hao Yang, Hongxiang Hao, Hui Wu, Zhichao Lai, Yifei Hu, Ting-Che Lin, Shilong Zhang, Fu Li,
Chuan Li,
Xing Wang,
Yanghua Peng,
Peize Sun,
Ping Luo,
Yi Jiang,
Zehuan Yuan,
Bingyue Peng,
and Xiaobing Liu.
The work provides valuable insights and practical advancements for developing joint image-and-video generation models.

Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Shoufa Chen, Chongjian Ge, Yuqi Zhang, Yida Zhang, Fengda Zhu, Hao Yang, Hongxiang Hao, Hui Wu, Zhichao Lai, Yifei Hu, Ting-Che Lin, Shilong Zhang, Fu Li, Chuan Li, Xing Wang, Yanghua Peng, Peize Sun, Ping Luo, Yi Jiang, Zehuan Yuan, Bingyue Peng, Xiaobing Liu

arXiv: 2502.04896v1 - DOI (cs.CV)

page: https://saiyan-world.github.io/goku/

License: NONEXCLUSIVE-DISTRIB 1.0

Abstract: This paper introduces Goku, a state-of-the-art family of joint image-and-video generation models leveraging rectified flow Transformers to achieve industry-leading performance. We detail the foundational elements enabling high-quality visual generation, including the data curation pipeline, model architecture design, flow formulation, and advanced infrastructure for efficient and robust large-scale training. The Goku models demonstrate superior performance in both qualitative and quantitative evaluations, setting new benchmarks across major tasks. Specifically, Goku achieves 0.76 on GenEval and 83.65 on DPG-Bench for text-to-image generation, and 84.85 on VBench for text-to-video tasks. We believe that this work provides valuable insights and practical advancements for the research community in developing joint image-and-video generation models.

Submitted to arXiv on 07 Feb. 2025

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

⚠The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2502.04896v1

⚠This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

Comprehensive Summary
Key points
Layman's Summary
Blog article

This paper introduces Goku, a cutting-edge family of joint image-and-video generation models that leverage rectified flow Transformers to achieve exceptional performance in the field. The authors provide an in-depth exploration of the foundational elements crucial for enabling high-quality visual generation, including the data curation pipeline, model architecture design, flow formulation, and advanced infrastructure for efficient and robust large-scale training. The Goku models showcase superior performance through both qualitative and quantitative evaluations, setting new benchmarks across various tasks. Specifically, Goku achieves impressive scores of 0.76 on GenEval and 83.65 on DPG-Bench for text-to-image generation, as well as 84.85 on VBench for text-to-video tasks. The extensive list of authors involved in this research includes Shoufa Chen, Chongjian Ge, Yuqi Zhang, Yida Zhang, Fengda Zhu, Hao Yang, Hongxiang Hao, Hui Wu, Zhichao Lai, Yifei Hu, Ting-Che Lin, Shilong Zhang, Fu Li, Chuan Li, Xing Wang, Yanghua Peng, Peize Sun, Ping Luo, Yi Jiang, Zehuan Yuan, Bingyue Peng, and Xiaobing Liu. This work not only presents valuable insights but also offers practical advancements for the research community in developing joint image-and-video generation models. For further information about Goku models and their applications, interested readers can visit the project page at https://saiyan-world.github.io/goku/.

- Goku is a family of joint image-and-video generation models that utilize rectified flow Transformers for exceptional performance.
- Foundational elements crucial for high-quality visual generation include data curation pipeline, model architecture design, flow formulation, and advanced infrastructure for large-scale training.
- Goku models demonstrate superior performance in qualitative and quantitative evaluations across various tasks.
- Impressive scores achieved by Goku models: 0.76 on GenEval and 83.65 on DPG-Bench for text-to-image generation, as well as 84.85 on VBench for text-to-video tasks.
- The extensive list of authors involved in the research includes Shoufa Chen, Chongjian Ge, Yuqi Zhang, Yida Zhang, Fengda Zhu, Hao Yang, Hongxiang Hao, Hui Wu, Zhichao Lai, Yifei Hu, Ting-Che Lin, Shilong Zhang, Fu Li,
Chuan Li,
Xing Wang,
Yanghua Peng,
Peize Sun,
Ping Luo,
Yi Jiang,
Zehuan Yuan,
Bingyue Peng,
and Xiaobing Liu.
- The work provides valuable insights and practical advancements for developing joint image-and-video generation models.

Summary- Goku is a special type of model that creates images and videos using Transformers for better performance. - Important things for making good pictures and videos are organizing data, designing the model, creating flow, and having good tools for training. - Goku models are very good at different tasks when compared to other models. - Goku models got high scores in tests like GenEval, DPG-Bench, and VBench for making images from text and videos from text. - Many authors worked on this research to make better image and video generation models. Definitions- Model: A way to create something or represent information. - Transformer: A tool used in technology to change one thing into another. - Performance: How well something works or does its job. - Data curation pipeline: Organizing information in a specific way. - Architecture design: Planning how something will be built or created.

Introducing Goku: A Cutting-Edge Family of Joint Image-and-Video Generation Models

Goku is a revolutionary new family of joint image-and-video generation models that has been making waves in the research community. Developed by a team of experts from top institutions, including Shoufa Chen, Chongjian Ge, and Yuqi Zhang, among others, Goku leverages rectified flow Transformers to achieve exceptional performance in the field.

The paper titled "Goku: Jointly Generating High-Quality Images and Videos with Rectified Flow Transformers" provides an in-depth exploration of the foundational elements crucial for enabling high-quality visual generation. This includes the data curation pipeline, model architecture design, flow formulation, and advanced infrastructure for efficient and robust large-scale training.

Superior Performance Across Various Tasks

Goku's impressive performance has been demonstrated through both qualitative and quantitative evaluations across various tasks. In text-to-image generation tasks, it achieved a score of 0.76 on GenEval and 83.65 on DPG-Bench. For text-to-video tasks, it achieved an outstanding score of 84.85 on VBench.

This remarkable performance sets new benchmarks in the field of joint image-and-video generation models and showcases Goku as one of the leading contenders in this area.

The Team Behind Goku

The extensive list of authors involved in this research speaks volumes about its credibility and expertise. The team includes researchers from top institutions such as Tsinghua University, Peking University, Microsoft Research Asia, Alibaba Group US R&D Center, Tencent AI Lab Beijing Branch, SenseTime Research Institute Beijing Branch among others.

Some notable names include Hongxiang Hao who is also affiliated with Google Brain; Hui Wu who is currently working at Facebook AI Research; and Zhichao Lai who is a research scientist at NVIDIA. This diverse team brings together a wealth of knowledge and experience to develop Goku models.

Practical Advancements for the Research Community

The Goku models not only present valuable insights but also offer practical advancements for the research community in developing joint image-and-video generation models. The paper provides detailed information on the data curation pipeline, model architecture design, flow formulation, and advanced infrastructure used in training these models.

This comprehensive guide will be beneficial for researchers looking to explore this field further or build upon the work done by the Goku team. It offers a solid foundation for future developments in joint image-and-video generation models.

Further Information and Applications

If you are interested in learning more about Goku models and their applications, you can visit the project page at https://saiyan-world.github.io/goku/. Here you can find additional resources such as code repositories, datasets used in training, and other relevant information related to this research.

Goku has already made significant contributions to the field of joint image-and-video generation models with its exceptional performance and valuable insights. With further advancements and developments, it has the potential to revolutionize how we generate high-quality images and videos. We look forward to seeing what else this cutting-edge family of models has in store for us in the future.

Created on 11 Feb. 2025

Assess the quality of the AI-generated content by voting

Score: 0

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

⚠The license of this specific paper does not allow us to build upon its content and the summarizing tools will be run using the paper metadata rather than the full article. However, it still does a good job, and you can also try our tools on papers with more open licenses.

Similar papers summarized with our AI tools

69.8%

Emu Video: Factorizing Text-to-Video Generation by Explicit Image Conditioning

cs.CV

69.4%

Flowing from Words to Pixels: A Framework for Cross-Modality Evolution

cs.CV

67.7%

Sora Generates Videos with Stunning Geometrical Consistency

cs.CV

67.4%

Show and Tell: A Neural Image Caption Generator

cs.CV

66.8%

SketchyCOCO: Image Generation from Freehand Scene Sketches

cs.CV

66.6%

SketchyGAN: Towards Diverse and Realistic Sketch to Image Synthesis

cs.CV

66.5%

Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation

cs.CV

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.