Evaluating Cognitive Maps and Planning in Large Language Models with CogEval

AI-generated keywords: CogEval Cognitive Maps Planning Ability LLMs Evaluation Protocols

AI-generated Key Points

The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

  • Growing interest in exploring cognitive abilities of large language models (LLMs)
  • Proposal of CogEval - a cognitive science-inspired protocol for systematic evaluation of cognitive capacities in LLMs
  • Application of CogEval to evaluate cognitive maps and planning ability across eight different LLMs
  • Task prompts based on human experiments with established construct validity
  • Findings reveal significant failure modes in planning tasks, such as hallucinations and getting trapped in loops
  • Lack of emergent out-of-the-box planning ability in LLMs
  • Possible explanation: LLMs lack understanding of latent relational structures underlying planning problems (cognitive maps)
  • Importance of rigorous evaluation protocols like CogEval to assess true cognitive capabilities of LLMs
  • Insights into limitations of LLMs' planning abilities discussed by authors
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Ida Momennejad, Hosein Hasanbeig, Felipe Vieira, Hiteshi Sharma, Robert Osazuwa Ness, Nebojsa Jojic, Hamid Palangi, Jonathan Larson

License: CC BY-NC-ND 4.0

Abstract: Recently an influx of studies claim emergent cognitive abilities in large language models (LLMs). Yet, most rely on anecdotes, overlook contamination of training sets, or lack systematic Evaluation involving multiple tasks, control conditions, multiple iterations, and statistical robustness tests. Here we make two major contributions. First, we propose CogEval, a cognitive science-inspired protocol for the systematic evaluation of cognitive capacities in Large Language Models. The CogEval protocol can be followed for the evaluation of various abilities. Second, here we follow CogEval to systematically evaluate cognitive maps and planning ability across eight LLMs (OpenAI GPT-4, GPT-3.5-turbo-175B, davinci-003-175B, Google Bard, Cohere-xlarge-52.4B, Anthropic Claude-1-52B, LLaMA-13B, and Alpaca-7B). We base our task prompts on human experiments, which offer both established construct validity for evaluating planning, and are absent from LLM training sets. We find that, while LLMs show apparent competence in a few planning tasks with simpler structures, systematic evaluation reveals striking failure modes in planning tasks, including hallucinations of invalid trajectories and getting trapped in loops. These findings do not support the idea of emergent out-of-the-box planning ability in LLMs. This could be because LLMs do not understand the latent relational structures underlying planning problems, known as cognitive maps, and fail at unrolling goal-directed trajectories based on the underlying structure. Implications for application and future directions are discussed.

Submitted to arXiv on 25 Sep. 2023

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2309.15129v1

This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

In recent studies, there has been a growing interest in exploring the cognitive abilities of large language models (LLMs). To address the limitations of existing studies which relied on anecdotal evidence or lacked systematic evaluation methods, the authors propose CogEval - a cognitive science-inspired protocol for the systematic evaluation of cognitive capacities in LLMs. This study applies CogEval to systematically evaluate cognitive maps and planning ability across eight different LLMs. The task prompts for evaluation are based on human experiments that offer established construct validity for evaluating planning and are not present in the LLM training sets. The findings reveal that while LLMs demonstrate apparent competence in some planning tasks with simpler structures, systematic evaluation uncovers significant failure modes in planning tasks such as hallucinations of invalid trajectories and getting trapped in loops. Consequently, these findings do not support the notion of emergent out-of-the-box planning ability in LLMs. One possible explanation is that LLMs lack an understanding of the latent relational structures underlying planning problems known as cognitive maps. The implications of these findings for application and future directions are discussed by the authors. Overall, this study highlights the importance of rigorous evaluation protocols like CogEval to assess the true cognitive capabilities of large language models and provides insights into their limitations when it comes to planning abilities.
Created on 28 Sep. 2023

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

The license of this specific paper does not allow us to build upon its content and the summarizing tools will be run using the paper metadata rather than the full article. However, it still does a good job, and you can also try our tools on papers with more open licenses.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.