GPT vs Human for Scientific Reviews: A Dual Source Review on Applications of ChatGPT in Science

AI-generated keywords: Large Language Models Scientific Reviews GPT Cross-Disciplinary Connections Ethical Concerns

AI-generated Key Points

The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

  • Large Language Models (LLMs) have the potential to speed up scientific reviews by utilizing unbiased quantitative metrics, fostering cross-disciplinary connections, and pinpointing emerging trends and research gaps.
  • Current LLMs lack a deep understanding of complex methodologies, struggle with evaluating innovative claims, and are unable to address ethical concerns and conflicts of interest.
  • The study focused on 13 GPT-related papers from various scientific domains reviewed by both a human reviewer and SciSpace, a large language model.
  • Findings revealed that 50% of SciSpace's responses to objective questions aligned with those of a human reviewer.
  • GPT-4 often rated the human reviewer higher in accuracy while favoring SciSpace for structure, clarity, and completeness.
  • Uninformed evaluators like GPT-3.5 and the crowd panel displayed varying preferences between SciSpace and human responses for subjective questions. The crowd panel showed a preference for human responses in these cases.
  • GPT-4 rated both SciSpace and human responses equally in terms of accuracy and structure but leaned towards SciSpace for completeness.
  • The study highlights strengths such as structuring information comprehensively but also points out limitations in areas requiring deep understanding of methodologies and ethical considerations.
  • Further research is needed to enhance the capabilities of large language models for more effective scientific review processes across disciplines.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Chenxi Wu, Alan John Varghese, Vivek Oommen, George Em Karniadakis

License: CC BY-NC-ND 4.0

Abstract: The new polymath Large Language Models (LLMs) can speed-up greatly scientific reviews, possibly using more unbiased quantitative metrics, facilitating cross-disciplinary connections, and identifying emerging trends and research gaps by analyzing large volumes of data. However, at the present time, they lack the required deep understanding of complex methodologies, they have difficulty in evaluating innovative claims, and they are unable to assess ethical issues and conflicts of interest. Herein, we consider 13 GPT-related papers across different scientific domains, reviewed by a human reviewer and SciSpace, a large language model, with the reviews evaluated by three distinct types of evaluators, namely GPT-3.5, a crowd panel, and GPT-4. We found that 50% of SciSpace's responses to objective questions align with those of a human reviewer, with GPT-4 (informed evaluator) often rating the human reviewer higher in accuracy, and SciSpace higher in structure, clarity, and completeness. In subjective questions, the uninformed evaluators (GPT-3.5 and crowd panel) showed varying preferences between SciSpace and human responses, with the crowd panel showing a preference for the human responses. However, GPT-4 rated them equally in accuracy and structure but favored SciSpace for completeness.

Submitted to arXiv on 05 Dec. 2023

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2312.03769v1

This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

In their paper titled "GPT vs Human for Scientific Reviews: A Dual Source Review on Applications of ChatGPT in Science," authors Chenxi Wu, Alan John Varghese, Vivek Oommen, and George Em Karniadakis explore the potential of Large Language Models (LLMs) in speeding up scientific reviews. These LLMs have the capability to utilize unbiased quantitative metrics, foster cross-disciplinary connections, and pinpoint emerging trends and research gaps by analyzing vast amounts of data. However, the authors note that current LLMs lack a deep understanding of complex methodologies and struggle with evaluating innovative claims. They are also unable to address ethical concerns and conflicts of interest. The study focuses on 13 GPT-related papers from various scientific domains that were reviewed by both a human reviewer and SciSpace, a large language model. The reviews were then evaluated by three different types of evaluators: GPT-3.5 (uninformed evaluator), a crowd panel, and GPT-4 (informed evaluator). The findings reveal that 50% of SciSpace's responses to objective questions align with those of a human reviewer. Interestingly, GPT-4 often rated the human reviewer higher in accuracy while favoring SciSpace for structure, clarity, and completeness. When it comes to subjective questions, uninformed evaluators like GPT-3.5 and the crowd panel displayed varying preferences between SciSpace and human responses. The crowd panel specifically showed a preference for human responses in these cases. However,GPT-4 rated both SciSpace and human responses equally in terms of accuracy and structure but leaned towards SciSpace for completeness. This study sheds light on the strengths and limitations of using large language models like SciSpace in scientific reviews. While they show promise in certain aspects such as structuring information comprehensively, there is still room for improvement in areas requiring deep understanding of methodologies and ethical considerations. Further research may help enhance the capabilities of these models for more effective scientific review processes across disciplines.
Created on 25 Mar. 2026

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.