Towards Systematic Evaluation of Logical Reasoning Ability of Large Language Models

AI-generated keywords: Large Language Models Logical Reasoning Natural Language Processing Question-Answering Dataset Non-Monotonic Logics

AI-generated Key Points

  • Recent advancements in large language models (LLMs) have shown impressive capabilities in natural language understanding tasks
  • Logical reasoning is crucial for question-answering systems and conversational agents
  • LogicBench is a dataset specifically evaluating LLMs' logical reasoning using a single inference rule, covering 25 distinct reasoning patterns across propositional logic, first-order logic, and non-monotonic logics
  • This work explores non-monotonic reasoning within NLP and includes various inference rules such as hypothetical syllogism and disjunctive syllogism
  • Existing LLMs struggle with complex reasoning instances and negations on LogicBench
  • The availability of data and code on GitHub facilitates reproducibility and encourages collaboration towards advancing logical reasoning abilities in LLMs
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Mihir Parmar, Nisarg Patel, Neeraj Varshney, Mutsumi Nakamura, Man Luo, Santosh Mashetty, Arindam Mitra, Chitta Baral

29 Pages
License: CC BY 4.0

Abstract: Recently developed large language models (LLMs) have been shown to perform remarkably well on a wide range of language understanding tasks. But, can they really "reason" over the natural language? This question has been receiving significant research attention and many reasoning skills such as commonsense, numerical, and qualitative have been studied. However, the crucial skill pertaining to 'logical reasoning' has remained underexplored. Existing work investigating this reasoning ability of LLMs has focused only on a couple of inference rules (such as modus ponens and modus tollens) of propositional and first-order logic. Addressing the above limitation, we comprehensively evaluate the logical reasoning ability of LLMs on 25 different reasoning patterns spanning over propositional, first-order, and non-monotonic logics. To enable systematic evaluation, we introduce LogicBench, a natural language question-answering dataset focusing on the use of a single inference rule. We conduct detailed analysis with a range of LLMs such as GPT-4, ChatGPT, Gemini, Llama-2, and Mistral using chain-of-thought prompting. Experimental results show that existing LLMs do not fare well on LogicBench; especially, they struggle with instances involving complex reasoning and negations. Furthermore, they sometimes overlook contextual information necessary for reasoning to arrive at the correct conclusion. We believe that our work and findings facilitate future research for evaluating and enhancing the logical reasoning ability of LLMs. Data and code are available at https://github.com/Mihir3009/LogicBench.

Submitted to arXiv on 23 Apr. 2024

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2404.15522v1

Introduction Recent advancements in large language models (LLMs) have shown impressive capabilities in natural language understanding tasks. However, their ability to engage in logical reasoning remains a topic of ongoing research. Logical reasoning is crucial for question-answering systems and conversational agents. Existing datasets evaluating LLMs' logical reasoning abilities often focus on limited aspects or combine it with other forms of reasoning. To address this gap, we introduce LogicBench - a meticulously designed dataset specifically evaluating LLMs' logical reasoning using a single inference rule. It covers 25 distinct reasoning patterns across propositional logic, first-order logic, and non-monotonic logics. This is the first work exploring non-monotonic reasoning within NLP and includes various inference rules such as hypothetical syllogism and disjunctive syllogism. We also analyze the prevalence of these rules in pre-training data and their impact on model performance. The evaluation tasks include Binary Question-Answering (BQA) and Multiple-Choice Questions-Answering (MCQA). Our results reveal that existing LLMs struggle with complex reasoning instances and negations on LogicBench. By shedding light on these limitations and providing insights into enhancing logical reasoning abilities in LLMs through systematic evaluation, our work paves the way for future research in this area. The availability of data and code on GitHub facilitates reproducibility and encourages collaboration towards advancing logical reasoning abilities in LLMs.
Created on 26 Aug. 2024

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.