A Survey on Vision-Language-Action Models for Embodied AI

AI-generated keywords: Embodied AI Vision-Language-Action Models Robotics Survey Research

AI-generated Key Points

The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

  • Embodied AI is a crucial component of artificial general intelligence focusing on controlling physical agents in the real world
  • Significant advancements have been made with vision-language-action models (VLAs) in recent years
  • VLAs generate actions based on language inputs to address language-conditioned robotic tasks
  • The survey by Yueen Ma, Zixing Song, Yuzheng Zhuang, Jianye Hao, and Irwin King categorizes VLAs into three main research lines: individual components, control policies for low-level action prediction, and high-level task planners
  • This categorization enables VLAs to effectively follow general user instructions
  • The survey highlights various VLAs developed recently and emphasizes the importance of systematic analysis for this rapidly evolving field
  • Relevant resources such as datasets, simulators, and benchmarks are essential for advancing research in embodied AI with VLAs
  • Challenges faced by VLAs in embodied AI are discussed along with promising future research directions
  • A project associated with the survey is available at https://github.com/yueen-ma/Awesome-VLA
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Yueen Ma, Zixing Song, Yuzheng Zhuang, Jianye Hao, Irwin King

Project page: https://github.com/yueen-ma/Awesome-VLA

Abstract: Embodied AI is widely recognized as a key element of artificial general intelligence because it involves controlling embodied agents to perform tasks in the physical world. Building on the success of large language models and vision-language models, a new category of multimodal models -- referred to as vision-language-action models (VLAs) -- has emerged to address language-conditioned robotic tasks in embodied AI by leveraging their distinct ability to generate actions. In recent years, a myriad of VLAs have been developed, making it imperative to capture the rapidly evolving landscape through a comprehensive survey. To this end, we present the first survey on VLAs for embodied AI. This work provides a detailed taxonomy of VLAs, organized into three major lines of research. The first line focuses on individual components of VLAs. The second line is dedicated to developing control policies adept at predicting low-level actions. The third line comprises high-level task planners capable of decomposing long-horizon tasks into a sequence of subtasks, thereby guiding VLAs to follow more general user instructions. Furthermore, we provide an extensive summary of relevant resources, including datasets, simulators, and benchmarks. Finally, we discuss the challenges faced by VLAs and outline promising future directions in embodied AI. We have created a project associated with this survey, which is available at https://github.com/yueen-ma/Awesome-VLA.

Submitted to arXiv on 23 May. 2024

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2405.14093v4

This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

Embodied AI is a crucial component of artificial general intelligence that focuses on controlling physical agents to carry out tasks in the real world. In recent years, significant advancements have been made in this field with the emergence of vision-language-action models (VLAs). These multimodal models leverage their ability to generate actions based on language inputs to address language-conditioned robotic tasks in embodied AI. A comprehensive survey conducted by authors Yueen Ma, Zixing Song, Yuzheng Zhuang, Jianye Hao, and Irwin King delves into the landscape of VLAs for embodied AI. The survey provides a detailed taxonomy of VLAs categorized into three main research lines: individual components of VLAs, control policies for low-level action prediction, and high-level task planners for breaking down complex tasks. This categorization enables VLAs to effectively follow general user instructions. The survey also highlights an array of VLAs that have been developed in recent years and emphasizes the importance of systematic analysis in capturing this rapidly evolving landscape. It offers insights into relevant resources such as datasets, simulators, and benchmarks essential for advancing research in this domain. Additionally, the authors discuss the challenges faced by VLAs in embodied AI and outline promising future directions for research and development. To further support researchers and practitioners interested in exploring and contributing to VLA advancements in embodied AI, the authors have created a project associated with this survey available at https://github.com/yueen-ma/Awesome-VLA. Overall, this survey serves as a valuable resource for those seeking to stay updated on the latest developments in vision-language-action models for embodied AI.
Created on 18 Mar. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.