Automated reproducibility assessments in the social and behavioral sciences using large language models

AI-generated keywords: Reproducibility Social and Behavioral Sciences Large Language Models Automated Assessments Validity

AI-generated Key Points

The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

  • Reproducibility of research findings is crucial for building a solid foundation of knowledge in social and behavioral sciences.
  • Traditional reproducibility assessments are labor-intensive and challenging to scale across a large number of studies.
  • A recent study led by Tobias Holtdirk et al. introduced an automated approach using large language models (LLMs) for reproducibility assessments in social and behavioral sciences.
  • The study evaluated 76 published studies with predefined claims and found that LLMs can effectively automate reproducibility assessments.
  • LLMs successfully recovered original effect sizes in 41% of studies and reached the same qualitative conclusion as the original study in 96% of cases.
  • LLMs outperformed human analysts in replicating original effect sizes and reaching consistent qualitative conclusions, suggesting their potential as a scalable tool for automating reproducibility assessments.
  • This study highlights the promising role of LLMs in streamlining reproducibility evaluations and improving efficiency and accuracy in assessing scientific findings.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Tobias Holtdirk, Pietro Marcolongo, Anna Steinberg Schulten, Felix Henninger, Stefan Rose, Sarah Ball, Bolei Ma, Frauke Kreuter, Markus Weinmann, Stefan Feuerriegel

Abstract: Reproducibility in the social and behavioral sciences is typically evaluated by independent researchers who reanalyze the original data to assess whether the published findings can be recovered. However, such approaches are resource-intensive and difficult to scale. Here, we show that large language models (LLMs) can automate reproducibility assessments. Using N=76 published studies with predefined claims from the behavioral and social sciences, we compare LLM-generated analysis with the original findings and human reanalysis. For 7 studies, the LLM could not produce a viable effect size estimate. For the remaining studies, our LLM pipeline recovered the original effect sizes in 41% of studies using a +/-0.05 tolerance in Cohen's d. Further, our LLM pipeline reached the same qualitative conclusion as the original study in 96% of cases, where conclusions indicate whether the reanalysis supports the original claim. For comparison, human reanalysts recovered the original effect sizes in 34% of studies and reached the same qualitative conclusion in 74% of cases. Together, these results show that LLMs can serve as a scalable tool for automated reproducibility assessment and provide a foundation for systematic auditing of empirical results in the social and behavioral sciences.

Submitted to arXiv on 11 Jun. 2026

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2606.13670v1

This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

In the field of social and behavioral sciences, ensuring reproducibility of research findings is crucial for building a solid foundation of knowledge. Traditionally, reproducibility assessments have been conducted by independent researchers who reanalyze original data to determine if published results can be replicated. However, this process is often labor-intensive and challenging to scale across a large number of studies. A recent study led by Tobias Holtdirk, Pietro Marcolongo, Anna Steinberg Schulten, Felix Henninger, Stefan Rose, Sarah Ball, Bolei Ma, Frauke Kreuter, Markus Weinmann, and Stefan Feuerriegel has introduced a novel approach to automating reproducibility assessments using large language models (LLMs). The researchers focused on evaluating 76 published studies in the social and behavioral sciences that had predefined claims. The results of their study demonstrated that LLMs can effectively automate reproducibility assessments. In cases where the LLM could generate a viable effect size estimate with a tolerance of +/-0.05 in Cohen's d., it successfully recovered the original effect sizes in 41% of studies. Additionally,<Organization>the LLM pipeline reached the same qualitative conclusion as the original study in an impressive 96% of cases. Comparing these results to those obtained through human reanalysis revealed that LLMs outperformed human analysts in terms of replicating original effect sizes and reaching consistent qualitative conclusions. This suggests that LLMs have the potential to serve as a scalable tool for automating reproducibility assessments in the social and behavioral sciences. Overall,<Organization>this study highlights the promising role that LLMs can play in streamlining reproducibility evaluations and laying the groundwork for systematic auditing of empirical results in social and behavioral research fields. By leveraging advanced technology like LLMs, researchers can enhance efficiency and accuracy in assessing the reliability and validity of scientific findings.
Created on 13 Jun. 2026

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.