Towards Holistic Evaluation of Large Audio-Language Models: A Comprehensive Survey

AI-generated keywords: Large Audio-Language Models Auditory Processing Multimodal Reasoning Taxonomy Evaluation

AI-generated Key Points

  • Large audio-language models (LALMs) have advanced traditional large language models (LLMs) by incorporating auditory processing.
  • LALMs excel in auditory tasks such as speech recognition, audio generation, and multimodal reasoning.
  • A comprehensive survey proposed a taxonomy for evaluating LALMs across four dimensions: General Auditory Awareness and Processing, Knowledge and Reasoning, Dialogue-oriented Ability, and Fairness, Safety, and Trustworthiness.
  • The taxonomy aims to provide clear guidelines for researchers and practitioners in the field to evaluate LALMs systematically.
  • Categorizing evaluations based on objectives offers a holistic view of LALM capabilities and identifies areas for improvement and future research directions.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Chih-Kai Yang, Neo S. Ho, Hung-yi Lee

Project Website: https://github.com/b08202033/LALM-Evaluation-Survey
License: CC BY 4.0

Abstract: With advancements in large audio-language models (LALMs), which enhance large language models (LLMs) with auditory capabilities, these models are expected to demonstrate universal proficiency across various auditory tasks. While numerous benchmarks have emerged to assess LALMs' performance, they remain fragmented and lack a structured taxonomy. To bridge this gap, we conduct a comprehensive survey and propose a systematic taxonomy for LALM evaluations, categorizing them into four dimensions based on their objectives: (1) General Auditory Awareness and Processing, (2) Knowledge and Reasoning, (3) Dialogue-oriented Ability, and (4) Fairness, Safety, and Trustworthiness. We provide detailed overviews within each category and highlight challenges in this field, offering insights into promising future directions. To the best of our knowledge, this is the first survey specifically focused on the evaluations of LALMs, providing clear guidelines for the community. We will release the collection of the surveyed papers and actively maintain it to support ongoing advancements in the field.

Submitted to arXiv on 21 May. 2025

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2505.15957v1

In recent years, the field of large audio-language models (LALMs) has seen significant advancements. These LALMs have expanded the capabilities of traditional large language models (LLMs) by incorporating auditory processing. They are designed to excel in a wide range of auditory tasks and have shown promise in various domains such as speech recognition, audio generation, and multimodal reasoning. Despite the emergence of benchmarks to evaluate LALM performance, there is a lack of a structured taxonomy to categorize and assess these models comprehensively. To address this gap, a comprehensive survey was conducted to propose a systematic taxonomy for evaluating LALMs across four key dimensions: General Auditory Awareness and Processing, Knowledge and Reasoning, Dialogue-oriented Ability, and Fairness, Safety, and Trustworthiness. Each dimension provides detailed insights into the specific objectives and challenges associated with evaluating LALMs in those areas. This survey represents the first focused effort on systematically evaluating LALMs and aims to provide clear guidelines for researchers and practitioners in the field. By categorizing evaluations based on their objectives, this taxonomy offers a holistic view of LALM capabilities and highlights areas for improvement and future research directions. The surveyed papers will be made available to support ongoing advancements in the field, ensuring that LALMs continue to evolve towards universal proficiency across auditory tasks.
Created on 17 Jul. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.