GPT-4 Enhanced Multimodal Grounding for Autonomous Driving: Leveraging Cross-Modal Attention with Large Language Models

Résumés déjà disponibles dans d'autres langues : en

Auteurs : Haicheng Liao, Huanming Shen, Zhenning Li, Chengyue Wang, Guofa Li, Yiming Bie, Chengzhong Xu

Résumé : In the field of autonomous vehicles (AVs), accurately discerning commander intent and executing linguistic commands within a visual context presents a significant challenge. This paper introduces a sophisticated encoder-decoder framework, developed to address visual grounding in AVs.Our Context-Aware Visual Grounding (CAVG) model is an advanced system that integrates five core encoders-Text, Image, Context, and Cross-Modal-with a Multimodal decoder. This integration enables the CAVG model to adeptly capture contextual semantics and to learn human emotional features, augmented by state-of-the-art Large Language Models (LLMs) including GPT-4. The architecture of CAVG is reinforced by the implementation of multi-head cross-modal attention mechanisms and a Region-Specific Dynamic (RSD) layer for attention modulation. This architectural design enables the model to efficiently process and interpret a range of cross-modal inputs, yielding a comprehensive understanding of the correlation between verbal commands and corresponding visual scenes. Empirical evaluations on the Talk2Car dataset, a real-world benchmark, demonstrate that CAVG establishes new standards in prediction accuracy and operational efficiency. Notably, the model exhibits exceptional performance even with limited training data, ranging from 50% to 75% of the full dataset. This feature highlights its effectiveness and potential for deployment in practical AV applications. Moreover, CAVG has shown remarkable robustness and adaptability in challenging scenarios, including long-text command interpretation, low-light conditions, ambiguous command contexts, inclement weather conditions, and densely populated urban environments. The code for the proposed model is available at our Github.

Soumis à arXiv le 06 Déc. 2023

Posez des questions sur cet article à notre assistant IA

Vous pouvez aussi discutez avec plusieurs papiers à la fois ici.

Instructions pour utiliser l'assistant IA ?

Résultats du processus de synthèse de l'article arXiv : 2312.03543v1

Résumé Complet
Points clés
Résumé vulgarisé
Article de blog

Le résumé n'est pas encore prêt

Les points clés ne sont pas encore prêts

Le résumé vulgarisé n'est pas encore prêt

L'article de blog n'est pas encore prêt

Créé le 16 Mai. 2025

Disponible dans d'autres langues : en

Évaluez la qualité du contenu généré par l'IA en votant

Note : 0

Certains éléments de l'article ne sont pas encore résumés, vous pouvez relancer le processus de synthèse en cliquant sur le bouton Exécuter ci-dessous.

GPT-4 Enhanced Multimodal Grounding for Autonomous Driving: Leveraging Cross-Modal Attention with Large Language Models

Posez des questions sur cet article à notre assistant IA

Résultats du processus de synthèse de l'article arXiv : 2312.03543v1

Articles similaires résumés avec nos outils d'IA