Evaluating ChatGPT-4 Vision on Brazil's National Undergraduate Computer Science Exam

AI-generated keywords: ChatGPT-4 Vision ENADE visual reasoning multimodal AI models educational assessments

AI-generated Key Points

  • Study evaluated ChatGPT-4 Vision performance on Brazil's 2021 National Undergraduate Exam (ENADE) for Computer Science
  • ChatGPT-4 Vision outperformed average exam participant, ranking within top 10 percentile
  • Challenges faced in question interpretation, logical reasoning, and visual acuity
  • Utilized generative AI tools like ChatGPT-4, GitHub Copilot, and Gemini Advanced for manuscript revision and data analysis code generation
  • Evaluation of Visual Language Models (VLMs) highlighted challenges in visual deductive reasoning tasks
  • Insights from previous studies can enhance understanding of ChatGPT-4 Vision's capabilities in multimodal reasoning tasks beyond visuals
  • Suggested developing MathVerse-like benchmark based on ENADE questions to further evaluate model's performance in real-world educational assessments
  • Integration of visual capabilities into Language Learning Models (LLMs) revolutionizing natural language processing and enhancing education in science and technology fields
  • Human oversight crucial to verify accuracy of models like ChatGPT-4 Vision and ensure fairness in high-stakes exams
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Nabor C. Mendonça

ACM Transactions on Computing Education, June 2024
Accepted for publication
License: CC BY 4.0

Abstract: The recent integration of visual capabilities into Large Language Models (LLMs) has the potential to play a pivotal role in science and technology education, where visual elements such as diagrams, charts, and tables are commonly used to improve the learning experience. This study investigates the performance of ChatGPT-4 Vision, OpenAI's most advanced visual model at the time the study was conducted, on the Bachelor in Computer Science section of Brazil's 2021 National Undergraduate Exam (ENADE). By presenting the model with the exam's open and multiple-choice questions in their original image format and allowing for reassessment in response to differing answer keys, we were able to evaluate the model's reasoning and self-reflecting capabilities in a large-scale academic assessment involving textual and visual content. ChatGPT-4 Vision significantly outperformed the average exam participant, positioning itself within the top 10 best score percentile. While it excelled in questions that incorporated visual elements, it also encountered challenges with question interpretation, logical reasoning, and visual acuity. The involvement of an independent expert panel to review cases of disagreement between the model and the answer key revealed some poorly constructed questions containing vague or ambiguous statements, calling attention to the critical need for improved question design in future exams. Our findings suggest that while ChatGPT-4 Vision shows promise in multimodal academic evaluations, human oversight remains crucial for verifying the model's accuracy and ensuring the fairness of high-stakes educational exams. The paper's research materials are publicly available at https://github.com/nabormendonca/gpt-4v-enade-cs-2021.

Submitted to arXiv on 14 Jun. 2024

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2406.09671v1

In this study, we evaluated the performance of ChatGPT-4 Vision on Brazil's 2021 National Undergraduate Exam (ENADE) for Computer Science. ChatGPT-4 Vision is an advanced visual model developed by OpenAI. By presenting the model with exam questions in their original image format and allowing for reassessment with different answer keys, we assessed its reasoning and self-reflective capabilities in a large-scale academic assessment involving textual and visual content. Our findings showed that ChatGPT-4 Vision outperformed the average exam participant, ranking within the top 10 percentile. However, it faced challenges in question interpretation, logical reasoning, and visual acuity. This study contributes to existing research by exploring ChatGPT-4 Vision's self-reflective abilities on high-stakes academic exams. We utilized generative AI tools like ChatGPT-4, GitHub Copilot, and Gemini Advanced for manuscript revision, table creation, figure formatting, and data analysis code generation. The evaluation of Visual Language Models (VLMs) by Zhang et al. highlighted challenges in visual deductive reasoning tasks and emphasized the need for improved spatial reasoning techniques proposed by Mayne and Wu. While our work differs from previous studies focusing on visual deductive reasoning tasks, insights from these studies can enhance our understanding of ChatGPT-4 Vision's capabilities in multimodal reasoning tasks beyond visuals. To further evaluate the model's performance in real-world educational assessments, we suggest developing a MathVerse-like benchmark based on ENADE questions. The integration of visual capabilities into Language Learning Models (LLMs) has revolutionized natural language processing and holds promise for enhancing education in science and technology fields. However, human oversight remains crucial to verify the accuracy of models like ChatGPT-4 Vision and ensure fairness in high-stakes exams. Our findings underscore the importance of improved question design to enhance model performance and highlight areas where further research is needed to maximize the potential of multimodal AI models in educational assessments.
Created on 17 Jun. 2024

Assess the quality of the AI-generated content by voting

Score: 1

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.