CAMEL-Bench: A Comprehensive Arabic LMM Benchmark

AI-generated keywords: Multimodal Models

AI-generated Key Points

  • Growing interest in developing large multimodal models (LMMs) for visual reasoning and understanding tasks
  • Introduction of benchmarks for evaluating LMMs, primarily focused on English-centric evaluations
  • Need for comprehensive evaluation benchmarks tailored to the Arabic language due to over 400 million Arabic speakers worldwide
  • Development of CAMEL-Bench specifically for evaluating Arabic LMMs across eight diverse domains
  • CAMEL-Bench includes 38 sub-domains with over 29,000 carefully curated questions by native Arabic speakers
  • Data filtering and verification process ensures QA text is originally in Arabic or accurately translated from English
  • Evaluation framework includes specialized metrics to assess performance of closed-source models like GPT-4 series and open-source LMMs
  • Analysis shows areas for improvement among best open-source models with an overall score of 62%
  • <Organization> aims to facilitate advancements in Arabic language modeling and promote inclusivity in multimodal model evaluations across different languages
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Sara Ghaboura, Ahmed Heakl, Omkar Thawakar, Ali Alharthi, Ines Riahi, Abduljalil Saif, Jorma Laaksonen, Fahad S. Khan, Salman Khan, Rao M. Anwer

10 pages, 5 figures, NAACL
License: CC BY-SA 4.0

Abstract: Recent years have witnessed a significant interest in developing large multimodal models (LMMs) capable of performing various visual reasoning and understanding tasks. This has led to the introduction of multiple LMM benchmarks to evaluate LMMs on different tasks. However, most existing LMM evaluation benchmarks are predominantly English-centric. In this work, we develop a comprehensive LMM evaluation benchmark for the Arabic language to represent a large population of over 400 million speakers. The proposed benchmark, named CAMEL-Bench, comprises eight diverse domains and 38 sub-domains including, multi-image understanding, complex visual perception, handwritten document understanding, video understanding, medical imaging, plant diseases, and remote sensing-based land use understanding to evaluate broad scenario generalizability. Our CAMEL-Bench comprises around 29,036 questions that are filtered from a larger pool of samples, where the quality is manually verified by native speakers to ensure reliable model assessment. We conduct evaluations of both closed-source, including GPT-4 series, and open-source LMMs. Our analysis reveals the need for substantial improvement, especially among the best open-source models, with even the closed-source GPT-4o achieving an overall score of 62%. Our benchmark and evaluation scripts are open-sourced.

Submitted to arXiv on 24 Oct. 2024

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2410.18976v1

, , , , In recent years, there has been a growing interest in developing large multimodal models (LMMs) capable of performing various visual reasoning and understanding tasks. To evaluate the performance of these LMMs, multiple benchmarks have been introduced, primarily focusing on English-centric evaluations. However, with over 400 million Arabic speakers worldwide, there is a need for comprehensive evaluation benchmarks tailored to the Arabic language. In response to this need, a new benchmark called CAMEL-Bench has been developed specifically for evaluating Arabic LMMs. This benchmark covers eight diverse domains including multimodal understanding and reasoning, OCR and document understanding, charts and diagrams, videos, cultural-specific content, medical images, agricultural images, and remote sensing understanding in Arabic. Within these domains, CAMEL-Bench encompasses 38 sub-domains with over 29,000 carefully curated questions by native Arabic speakers. The data filtering and verification process for CAMEL-Bench involves ensuring that the QA text is originally in Arabic or accurately translated from English. Subsets of data undergo manual verification to maintain quality standards. Through this rigorous process, 29,036 high-quality questions are obtained for evaluation. The evaluation framework for CAMEL-Bench includes specialized metrics designed to assess the performance of both closed-source models like GPT-4 series and open-source LMMs. The analysis reveals areas for improvement among the best open-source models in achieving an overall score of 62%. By providing an open-sourced benchmark and evaluation scripts,<Organization> aims to facilitate advancements in Arabic language modeling and promote inclusivity in multimodal model evaluations across different languages.
Created on 29 Jan. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.