MMToM-QA: Multimodal Theory of Mind Question Answering

AI-generated keywords: Artificial Intelligence Theory of Mind Multimodal Data BIP-ALM MMToM-QA

AI-generated Key Points

  • Artificial intelligence plays a crucial role in developing machines with human-level social intelligence
  • Recent advancements in machine learning, particularly large language models, show promise in understanding aspects of Theory of Mind (ToM)
  • Existing ToM benchmarks primarily rely on unimodal datasets, which do not fully capture the complexity of human ToM reasoning
  • A new benchmark called [Benchmark Name] has been introduced to evaluate machine ToM on both multimodal and unimodal data related to a person's activities in a household environment
  • A novel method known as [Novel Method Name] has been proposed to enhance multimodal ToM capacity by extracting unified representations from multimodal data and leveraging language models for scalable Bayesian inverse planning
  • BIP-ALM showed promising results by combining model-based mental inference with language models effectively, outperforming state-of-the-art models like GPT-4
  • The evaluation protocol involved testing the models under three conditions: Multimodal QA, Text QA, and Video QA in zero-shot evaluation setting
  • BIP-ALM utilizes Bayesian Inverse Planning to infer a person's mental state based on video and text inputs, extending traditional Bayesian inverse planning methods.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Chuanyang Jin, Yutong Wu, Jing Cao, Jiannan Xiang, Yen-Ling Kuo, Zhiting Hu, Tomer Ullman, Antonio Torralba, Joshua B. Tenenbaum, Tianmin Shu

ACL 2024. 26 pages, 11 figures, 7 tables
License: CC BY 4.0

Abstract: Theory of Mind (ToM), the ability to understand people's mental states, is an essential ingredient for developing machines with human-level social intelligence. Recent machine learning models, particularly large language models, seem to show some aspects of ToM understanding. However, existing ToM benchmarks use unimodal datasets - either video or text. Human ToM, on the other hand, is more than video or text understanding. People can flexibly reason about another person's mind based on conceptual representations (e.g., goals, beliefs, plans) extracted from any available data. To address this, we introduce a multimodal Theory of Mind question answering (MMToM-QA) benchmark. MMToM-QA comprehensively evaluates machine ToM both on multimodal data and on different kinds of unimodal data about a person's activity in a household environment. To engineer multimodal ToM capacity, we propose a novel method, BIP-ALM (Bayesian Inverse Planning Accelerated by Language Models). BIP-ALM extracts unified representations from multimodal data and utilizes language models for scalable Bayesian inverse planning. We conducted a systematic comparison of human performance, BIP-ALM, and state-of-the-art models, including GPT-4. The experiments demonstrate that large language models and large multimodal models still lack robust ToM capacity. BIP-ALM, on the other hand, shows promising results, by leveraging the power of both model-based mental inference and language models.

Submitted to arXiv on 16 Jan. 2024

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2401.08743v2

In the realm of artificial intelligence, plays a crucial role in developing machines with human-level social intelligence. Recent advancements in machine learning, particularly through large language models, have shown some promise in understanding aspects of ToM. However, existing ToM benchmarks primarily rely on unimodal datasets, such as video or text, which do not fully capture the complexity of human ToM reasoning. To address this limitation, a new benchmark called has been introduced. This benchmark evaluates machine ToM on both multimodal and unimodal data related to a person's activities in a household environment. To enhance multimodal ToM capacity, a novel method known as has been proposed. BIP-ALM extracts unified representations from multimodal data and leverages language models for scalable Bayesian inverse planning. A systematic comparison was conducted between human performance, BIP-ALM, and state-of-the-art models like GPT-4. The results revealed that while large language models and multimodal models still struggle with robust ToM capacity, BIP-ALM showed promising results by combining model-based mental inference with language models effectively. The evaluation protocol involved testing the models under three conditions: Multimodal QA with both video and text inputs present, Text QA with only text input, and Video QA with only video input. The zero-shot evaluation setting required models to generalize learned knowledge from training data without specific examples provided during training. The BIP-ALM model utilizes Bayesian Inverse Planning to infer a person's mental state based on video and text inputs. By building unified representations from multimodal inputs and fine-tuning language models for efficient inverse symbolic planning, BIP-ALM extends the capabilities of traditional Bayesian inverse planning methods. Overall, the introduction of along with the innovative represents significant contributions towards advancing machine understanding of Theory of Mind in complex real-world scenarios.
Created on 09 Nov. 2024

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.