Vision language models are blind

AI-generated keywords: Vision Language Models Limitations Visual Tasks Benchmarks BlindTest

AI-generated Key Points

  • Authors highlight limitations of large language models with vision capabilities (VLMs) like GPT-4o and Gemini 1.5 Pro
  • VLMs excel in image-text applications but struggle with basic visual tasks
  • Study found state-of-the-art VLMs performed poorly on visual tasks
  • Vision capabilities of models likened to a person with myopia or an intelligent person making educated guesses while blind
  • Authors propose new benchmark called BlindTest for low-level visual tasks without prior knowledge or complex reasoning
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Pooyan Rahmanzadehgervi, Logan Bolton, Mohammad Reza Taesiri, Anh Totti Nguyen

License: CC BY 4.0

Abstract: Large language models with vision capabilities (VLMs), e.g., GPT-4o and Gemini 1.5 Pro are powering countless image-text applications and scoring high on many vision-understanding benchmarks. Yet, we find that VLMs fail on 7 visual tasks absurdly easy to humans such as identifying (a) whether two circles overlap; (b) whether two lines intersect; (c) which letter is being circled in a word; and (d) counting the number of circles in a Olympic-like logo. The shockingly poor performance of four state-of-the-art VLMs suggests their vision is, at best, like of a person with myopia seeing fine details as blurry, and at worst, like an intelligent person that is blind making educated guesses. Code is available at: https://vlmsareblind.github.io/

Submitted to arXiv on 09 Jul. 2024

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2407.06581v1

In their paper titled "Vision language models are blind," authors Pooyan Rahmanzadehgervi, Logan Bolton, Mohammad Reza Taesiri, and Anh Totti Nguyen highlight the limitations of large language models with vision capabilities (VLMs) such as GPT-4o and Gemini 1.5 Pro. These models excel in many image-text applications and vision-understanding benchmarks but struggle with basic visual tasks that are trivial for humans. The authors conducted a study to assess the performance of four state-of-the-art VLMs on these visual tasks and found them to be shockingly poor. They likened the vision capabilities of these models to that of a person with myopia who sees fine details as blurry or even to an intelligent person making educated guesses while being blind. The code for their study is available at https://vlmsareblind.github.io/. Additionally, the authors discuss existing benchmarks used to evaluate VLMs' vision understanding abilities. These include assessing performance on college-level topics, charts, documents, or videos. While these benchmarks have shown rapid progress in VLMs' performance, they often rely on real-world data that require extensive prior knowledge and can suffer from data leakage issues where VLMs can provide accurate answers without even needing the input image. To address these limitations, the authors propose a new benchmark called BlindTest that focuses on low-level visual tasks that are extremely easy for humans and do not require prior knowledge or complex reasoning. This benchmark aims to provide a more accurate assessment of VLMs' vision capabilities by testing their ability to perceive images like humans do without relying heavily on language processing. Overall, this study sheds light on the challenges faced by current VLMs in basic visual tasks and proposes a new benchmark to better evaluate their vision understanding abilities.
Created on 26 Aug. 2024

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.