Make-An-Animation: Large-Scale Text-conditional 3D Human Motion Generation

AI-generated keywords: Make-An-Animation

AI-generated Key Points

⚠The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

Make-An-Animation is a text-conditioned human motion generation model
Developed to improve the quality of generated motions in animation and robotics applications
Diffusion models have improved the quality of generated motions, but existing approaches are limited by small-scale motion capture data
Make-An-Animation has been trained on large-scale image-text datasets in two stages:
First stage: trained on curated dataset of (text, static pseudo-pose) pairs extracted from image-text datasets
Second stage: fine-tuned on motion capture data by adding additional layers to model the temporal dimension.

Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Samaneh Azadi, Akbar Shah, Thomas Hayes, Devi Parikh, Sonal Gupta

arXiv: 2305.09662v1 - DOI (cs.CV)

arXiv admin note: text overlap with arXiv:2304.07410

License: NONEXCLUSIVE-DISTRIB 1.0

Abstract: Text-guided human motion generation has drawn significant interest because of its impactful applications spanning animation and robotics. Recently, application of diffusion models for motion generation has enabled improvements in the quality of generated motions. However, existing approaches are limited by their reliance on relatively small-scale motion capture data, leading to poor performance on more diverse, in-the-wild prompts. In this paper, we introduce Make-An-Animation, a text-conditioned human motion generation model which learns more diverse poses and prompts from large-scale image-text datasets, enabling significant improvement in performance over prior works. Make-An-Animation is trained in two stages. First, we train on a curated large-scale dataset of (text, static pseudo-pose) pairs extracted from image-text datasets. Second, we fine-tune on motion capture data, adding additional layers to model the temporal dimension. Unlike prior diffusion models for motion generation, Make-An-Animation uses a U-Net architecture similar to recent text-to-video generation models. Human evaluation of motion realism and alignment with input text shows that our model reaches state-of-the-art performance on text-to-motion generation.

Submitted to arXiv on 16 May. 2023

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

⚠The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2305.09662v1

⚠This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

Comprehensive Summary
Key points
Layman's Summary
Blog article

Make-An-Animation is a text-conditioned human motion generation model that has been developed to improve the quality of generated motions in applications spanning animation and robotics. The recent application of diffusion models for motion generation has enabled significant improvements in the quality of generated motions. However, existing approaches are limited by their reliance on relatively small-scale motion capture data, leading to poor performance on more diverse, in-the-wild prompts. To address this limitation, Make-An-Animation has been trained on large-scale image-text datasets in two stages. In the first stage, the model is trained on a curated dataset of (text, static pseudo-pose) pairs extracted from image-text datasets. In the second stage, it is fine-tuned on motion capture data by adding additional layers to model the temporal dimension.

- Make-An-Animation is a text-conditioned human motion generation model
- Developed to improve the quality of generated motions in animation and robotics applications
- Diffusion models have improved the quality of generated motions, but existing approaches are limited by small-scale motion capture data
- Make-An-Animation has been trained on large-scale image-text datasets in two stages:
- First stage: trained on curated dataset of (text, static pseudo-pose) pairs extracted from image-text datasets
- Second stage: fine-tuned on motion capture data by adding additional layers to model the temporal dimension.

Summary: Make-An-Animation is a computer program that helps make better animations and robots move more realistically. It uses big sets of pictures and words to learn how things should look and move. It has two parts: first, it learns from pictures and words, then it gets better by practicing with real motion capture data. Definitions- Animation: a way of making pictures or objects appear to move - Robotics: the study of robots and how they work - Motion generation model: a computer program that creates movement - Diffusion models: a type of mathematical model used in statistics to describe how things spread out over time or space - Motion capture data: information about how people or objects move, captured by special cameras or sensors

Make-An-Animation: A Text-Conditioned Human Motion Generation Model

The recent application of diffusion models for motion generation has enabled significant improvements in the quality of generated motions. However, existing approaches are limited by their reliance on relatively small-scale motion capture data, leading to poor performance on more diverse, in-the-wild prompts. To address this limitation, Make-An-Animation is a text-conditioned human motion generation model that has been developed to improve the quality of generated motions in applications spanning animation and robotics.

Overview

Make-An-Animation is trained on large scale image text datasets in two stages. In the first stage, it is trained on a curated dataset of (text, static pseudo pose) pairs extracted from image text datasets. In the second stage, it is fine tuned on motion capture data by adding additional layers to model the temporal dimension. This allows for improved performance when generating motions from more diverse prompts such as those found “in the wild”.

First Stage Training

The first stage training involves curating a dataset consisting of (text, static pseudo pose) pairs extracted from image text datasets. This dataset serves as input into Make An Animation which then learns how to generate realistic human motions based off these inputs. By using this method, Make An Animation can learn how to generate motions without relying solely on motion capture data which can be limited due to its size and scope.

Second Stage Training

After completing the first stage training process with static poses as input, Make An Animation then moves onto its second stage training process where it adds additional layers designed specifically for modeling temporal dimensions such as speed and acceleration over time when generating motions from an input prompt or sequence of images/texts . By doing so ,Make An Animation can better simulate real world movements while still being able to generate high quality animations even when given more diverse prompts than what would typically be used with traditional motion capture techniques .

Conclusion

Make An Animation provides a powerful tool for animators and roboticists alike who need access to high quality animations without having access to large amounts of expensive motion capture equipment or software packages . By leveraging both image/text datasets and motion capture data ,Make An Animation offers users an efficient way to create realistic animations quickly and easily .

Created on 17 May. 2023

Assess the quality of the AI-generated content by voting

Score: 0

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

⚠The license of this specific paper does not allow us to build upon its content and the summarizing tools will be run using the paper metadata rather than the full article. However, it still does a good job, and you can also try our tools on papers with more open licenses.

Similar papers summarized with our AI tools

69.9%

Large language models effectively leverage document-level context for literar…

cs.CL

69.3%

Emergent autonomous scientific research capabilities of large language models

physics.chem-ph

68.5%

Training language models to follow instructions with human feedback

cs.CL

67.8%

Learning to Shift Attention for Motion Generation

cs.RO

66.9%

Learning Human-to-Robot Handovers from Point Clouds

cs.RO

66.7%

CodeGen2: Lessons for Training LLMs on Programming and Natural Languages

cs.LG

66.7%

Generative Agents: Interactive Simulacra of Human Behavior

cs.HC

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.