StockBench: Can LLM Agents Trade Stocks Profitably In Real-world Markets?

AI-generated keywords: Large Language Models Stock Trading Financial Agents StockBench Performance Evaluation

AI-generated Key Points

The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

  • Study titled "StockBench: Can LLM Agents Trade Stocks Profitably In Real-world Markets?" explores Large Language Models (LLMs) in finance domain.
  • Evaluates proprietary models like GPT-5 and Claude-4, as well as open-weight models such as Qwen3, Kimi-K2, and GLM-4.5 within StockBench framework.
  • Introduces StockBench benchmark to evaluate LLM agents in realistic multi-month stock trading environments.
  • Agents make sequential buy-sell-hold decisions based on daily market signals including prices, fundamentals, and news updates.
  • Performance evaluation based on financial metrics like cumulative return, maximum drawdown, and Sortino ratio.
  • Results show most LLM agents struggle to outperform simple buy-and-hold strategy but exhibit potential for higher returns and better risk management.
  • Authors release StockBench as open-source resource to facilitate reproducibility and further research in LLM-powered financial agent development.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Yanxu Chen, Zijun Yao, Yantao Liu, Jin Ye, Jianing Yu, Lei Hou, Juanzi Li

Abstract: Large language models (LLMs) have recently demonstrated strong capabilities as autonomous agents, showing promise in reasoning, tool use, and sequential decision-making. While prior benchmarks have evaluated LLM agents in domains such as software engineering and scientific discovery, the finance domain remains underexplored, despite its direct relevance to economic value and high-stakes decision-making. Existing financial benchmarks primarily test static knowledge through question answering, but they fall short of capturing the dynamic and iterative nature of trading. To address this gap, we introduce StockBench, a contamination-free benchmark designed to evaluate LLM agents in realistic, multi-month stock trading environments. Agents receive daily market signals -- including prices, fundamentals, and news -- and must make sequential buy, sell, or hold decisions. Performance is assessed using financial metrics such as cumulative return, maximum drawdown, and the Sortino ratio. Our evaluation of state-of-the-art proprietary (e.g., GPT-5, Claude-4) and open-weight (e.g., Qwen3, Kimi-K2, GLM-4.5) models shows that while most LLM agents struggle to outperform the simple buy-and-hold baseline, several models demonstrate the potential to deliver higher returns and manage risk more effectively. These findings highlight both the challenges and opportunities in developing LLM-powered financial agents, showing that excelling at static financial knowledge tasks does not necessarily translate into successful trading strategies. We release StockBench as an open-source resource to support reproducibility and advance future research in this domain.

Submitted to arXiv on 02 Oct. 2025

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2510.02209v1

This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

In their recent study titled "StockBench: Can LLM Agents Trade Stocks Profitably In Real-world Markets? ", authors Yanxu Chen, Zijun Yao, Yantao Liu, Jin Ye, Jianing Yu, Lei Hou, and Juanzi Li explore the capabilities of Large Language Models (LLMs) as autonomous agents in the finance domain. The study evaluates both state-of-the-art proprietary models like GPT-5 and Claude-4 as well as open-weight models such as Qwen3, Kimi-K2, and GLM-4.5 within the StockBench framework. <br> , specifically LLMs have shown promise in various domains such as reasoning and tool use. However, their application in high-stakes decision-making areas like remains underexplored. Existing financial benchmarks primarily focus on static knowledge assessment through question answering tasks but fail to capture the dynamic and iterative nature of trading. To bridge this gap, the authors introduce , a contamination-free benchmark specifically designed to evaluate LLM agents in realistic multi-month stock trading environments.<br> In this setup, agents receive daily market signals encompassing prices, fundamentals and news updates and are required to make sequential buy-sell-hold decisions. Performance evaluation is based on financial metrics including cumulative return, maximum drawdown and the Sortino ratio.<br> The results indicate that while most LLM agents struggle to outperform a simple buy-and-hold strategy baseline, exhibit potential for delivering higher returns and managing risk more effectively. This highlights the challenges and opportunities associated with developing LLM-powered financial agents; excelling at static financial knowledge tasks does not necessarily translate into successful trading strategies.<br> In conclusion, the authors release StockBench as an open-source resource to facilitate reproducibility and further research advancements in the field of LLM-powered financial agent development. This comprehensive study sheds light on the complexities involved in leveraging language models for real-world stock trading applications while emphasizing the need for innovative approaches to enhance performance in dynamic trading environments.
Created on 06 Oct. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.