Automated reproducibility assessments in the social and behavioral sciences using large language models

AI-generated keywords: Reproducibility Social and Behavioral Sciences Large Language Models Automated Assessments Validity

AI-generated Key Points

⚠The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

Reproducibility of research findings is crucial for building a solid foundation of knowledge in social and behavioral sciences.
Traditional reproducibility assessments are labor-intensive and challenging to scale across a large number of studies.
A recent study led by Tobias Holtdirk et al. introduced an automated approach using large language models (LLMs) for reproducibility assessments in social and behavioral sciences.
The study evaluated 76 published studies with predefined claims and found that LLMs can effectively automate reproducibility assessments.
LLMs successfully recovered original effect sizes in 41% of studies and reached the same qualitative conclusion as the original study in 96% of cases.
LLMs outperformed human analysts in replicating original effect sizes and reaching consistent qualitative conclusions, suggesting their potential as a scalable tool for automating reproducibility assessments.
This study highlights the promising role of LLMs in streamlining reproducibility evaluations and improving efficiency and accuracy in assessing scientific findings.

Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Tobias Holtdirk, Pietro Marcolongo, Anna Steinberg Schulten, Felix Henninger, Stefan Rose, Sarah Ball, Bolei Ma, Frauke Kreuter, Markus Weinmann, Stefan Feuerriegel

arXiv: 2606.13670v1 - DOI (cs.AI)

License: NONEXCLUSIVE-DISTRIB 1.0

Abstract: Reproducibility in the social and behavioral sciences is typically evaluated by independent researchers who reanalyze the original data to assess whether the published findings can be recovered. However, such approaches are resource-intensive and difficult to scale. Here, we show that large language models (LLMs) can automate reproducibility assessments. Using N=76 published studies with predefined claims from the behavioral and social sciences, we compare LLM-generated analysis with the original findings and human reanalysis. For 7 studies, the LLM could not produce a viable effect size estimate. For the remaining studies, our LLM pipeline recovered the original effect sizes in 41% of studies using a +/-0.05 tolerance in Cohen's d. Further, our LLM pipeline reached the same qualitative conclusion as the original study in 96% of cases, where conclusions indicate whether the reanalysis supports the original claim. For comparison, human reanalysts recovered the original effect sizes in 34% of studies and reached the same qualitative conclusion in 74% of cases. Together, these results show that LLMs can serve as a scalable tool for automated reproducibility assessment and provide a foundation for systematic auditing of empirical results in the social and behavioral sciences.

Submitted to arXiv on 11 Jun. 2026

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

⚠The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2606.13670v1

⚠This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

Comprehensive Summary
Key points
Layman's Summary
Blog article

In the field of social and behavioral sciences, ensuring reproducibility of research findings is crucial for building a solid foundation of knowledge. Traditionally, reproducibility assessments have been conducted by independent researchers who reanalyze original data to determine if published results can be replicated. However, this process is often labor-intensive and challenging to scale across a large number of studies. A recent study led by Tobias Holtdirk, Pietro Marcolongo, Anna Steinberg Schulten, Felix Henninger, Stefan Rose, Sarah Ball, Bolei Ma, Frauke Kreuter, Markus Weinmann, and Stefan Feuerriegel has introduced a novel approach to automating reproducibility assessments using large language models (LLMs). The researchers focused on evaluating 76 published studies in the social and behavioral sciences that had predefined claims. The results of their study demonstrated that LLMs can effectively automate reproducibility assessments. In cases where the LLM could generate a viable effect size estimate with a tolerance of +/-0.05 in Cohen's d., it successfully recovered the original effect sizes in 41% of studies. Additionally,<Organization>the LLM pipeline reached the same qualitative conclusion as the original study in an impressive 96% of cases. Comparing these results to those obtained through human reanalysis revealed that LLMs outperformed human analysts in terms of replicating original effect sizes and reaching consistent qualitative conclusions. This suggests that LLMs have the potential to serve as a scalable tool for automating reproducibility assessments in the social and behavioral sciences. Overall,<Organization>this study highlights the promising role that LLMs can play in streamlining reproducibility evaluations and laying the groundwork for systematic auditing of empirical results in social and behavioral research fields. By leveraging advanced technology like LLMs, researchers can enhance efficiency and accuracy in assessing the reliability and validity of scientific findings.

- Reproducibility of research findings is crucial for building a solid foundation of knowledge in social and behavioral sciences.
- Traditional reproducibility assessments are labor-intensive and challenging to scale across a large number of studies.
- A recent study led by Tobias Holtdirk et al. introduced an automated approach using large language models (LLMs) for reproducibility assessments in social and behavioral sciences.
- The study evaluated 76 published studies with predefined claims and found that LLMs can effectively automate reproducibility assessments.
- LLMs successfully recovered original effect sizes in 41% of studies and reached the same qualitative conclusion as the original study in 96% of cases.
- LLMs outperformed human analysts in replicating original effect sizes and reaching consistent qualitative conclusions, suggesting their potential as a scalable tool for automating reproducibility assessments.
- This study highlights the promising role of LLMs in streamlining reproducibility evaluations and improving efficiency and accuracy in assessing scientific findings.

Summary- Making sure that research findings can be repeated is very important in social and behavioral sciences. - Checking if research can be repeated the usual way is hard and takes a lot of work. - A new study by Tobias Holtdirk and others used big language models to check if research can be repeated in social and behavioral sciences. - The study looked at 76 other studies and found that these big language models could help check if the research could be repeated automatically. - These big language models did a good job in repeating the original results in many cases, showing they could be helpful for checking research. Definitions- Reproducibility: The ability to repeat or replicate a scientific study to confirm its findings. - Assessments: Evaluations or judgments made about something based on certain criteria. - Automated: Done by machines or computers without needing human input for each step. - Language models (LLMs): Advanced computer programs designed to understand and generate human language.

Introduction

In the field of social and behavioral sciences, reproducibility is a crucial aspect of building a solid foundation of knowledge. Reproducibility assessments involve independent researchers reanalyzing original data to determine if published results can be replicated. However, this process is often labor-intensive and challenging to scale across a large number of studies. Recently, a team of researchers led by Tobias Holtdirk, Pietro Marcolongo, Anna Steinberg Schulten, Felix Henninger, Stefan Rose, Sarah Ball, Bolei Ma, Frauke Kreuter, Markus Weinmann and Stefan Feuerriegel introduced a novel approach to automating reproducibility assessments using large language models (LLMs). This groundbreaking study focused on evaluating 76 published studies in the social and behavioral sciences that had predefined claims.

The Study

The goal of this study was to determine whether LLMs could effectively automate reproducibility assessments in the social and behavioral sciences. The researchers used an LLM pipeline to analyze the 76 selected studies and compared their results with those obtained through human reanalysis. The LLM pipeline utilized advanced technology such as natural language processing (NLP) algorithms to extract relevant information from each study's text. It then generated effect size estimates with a tolerance of +/-0.05 in Cohen's d., which measures the standardized difference between two means.

Results

The results of this study were impressive. In cases where the LLM could generate a viable effect size estimate within its defined tolerance range (+/-0.05), it successfully recovered the original effect sizes in 41% of studies. This indicates that LLMs have potential for accurately replicating original findings. Moreover,the LLM pipeline reached consistent qualitative conclusions as the original study in an impressive 96% of cases. This means that even when exact effect sizes could not be replicated, the LLMs still produced similar qualitative results as the original study. Comparing these results to those obtained through human reanalysis revealed that LLMs outperformed human analysts in terms of replicating original effect sizes and reaching consistent qualitative conclusions. This suggests that LLMs have the potential to serve as a scalable tool for automating reproducibility assessments in the social and behavioral sciences.

Implications

The use of LLMs in this study has significant implications for the field of social and behavioral sciences. By leveraging advanced technology like NLP algorithms, researchers can enhance efficiency and accuracy in assessing the reliability and validity of scientific findings. LLMs have the potential to streamline reproducibility evaluations, making it easier to scale across a large number of studies. This is especially important given the increasing volume of research being published in these fields. With automated reproducibility assessments, researchers can save time and resources while also ensuring that their findings are reliable. Moreover,this study highlights how LLMs can lay the groundwork for systematic auditing of empirical results in social and behavioral research fields. By automating reproducibility assessments, researchers can identify any discrepancies or inconsistencies between studies more efficiently, leading to a more robust body of knowledge.

Conclusion

In conclusion,this groundbreaking study by Holtdirk et al. demonstrates how large language models (LLMs) have the potential to automate reproducibility assessments in social and behavioral sciences effectively. The results showed that LLMs were able to replicate original effect sizes with high accuracy and reach consistent qualitative conclusions compared to human reanalysis. This study highlights how advanced technology like NLP algorithms can enhance efficiency and accuracy in assessing scientific findings' reliability and validity. It also opens up possibilities for future research on using LLMs for automating other aspects of research processes in various fields. Overall,this study highlights the promising role that LLMs can play in streamlining reproducibility evaluations and laying the groundwork for systematic auditing of empirical results in social and behavioral research fields. By leveraging advanced technology like LLMs, researchers can enhance efficiency and accuracy in assessing the reliability and validity of scientific findings.

Created on 13 Jun. 2026

Assess the quality of the AI-generated content by voting

Score: 0

Similar papers summarized with our AI tools

73.2%

From News to Forecast: Integrating Event Analysis in LLM-Based Time Series Fore…

cs.AI

72.6%

Less Context, Better Agents: Efficient Context Engineering for Long-Horizon Too…

cs.AI

71.4%

Large language models for automated scholarly paper review: A survey

cs.AI

71.4%

BotzoneBench: Scalable LLM Evaluation via Graded AI Anchors

cs.AI

68.6%

From Query Tools to Causal Architects: Harnessing Large Language Models for A…

cs.AI

68.4%

The Rise and Potential of Large Language Model Based Agents: A Survey

cs.AI

68.2%

COMMA: A Communicative Multimodal Multi-Agent Benchmark

cs.AI

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.