StockBench: Can LLM Agents Trade Stocks Profitably In Real-world Markets?

AI-generated keywords: Large Language Models Stock Trading Financial Agents StockBench Performance Evaluation

AI-generated Key Points

⚠The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

Study titled "StockBench: Can LLM Agents Trade Stocks Profitably In Real-world Markets?" explores Large Language Models (LLMs) in finance domain.
Evaluates proprietary models like GPT-5 and Claude-4, as well as open-weight models such as Qwen3, Kimi-K2, and GLM-4.5 within StockBench framework.
Introduces StockBench benchmark to evaluate LLM agents in realistic multi-month stock trading environments.
Agents make sequential buy-sell-hold decisions based on daily market signals including prices, fundamentals, and news updates.
Performance evaluation based on financial metrics like cumulative return, maximum drawdown, and Sortino ratio.
Results show most LLM agents struggle to outperform simple buy-and-hold strategy but exhibit potential for higher returns and better risk management.
Authors release StockBench as open-source resource to facilitate reproducibility and further research in LLM-powered financial agent development.

Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Yanxu Chen, Zijun Yao, Yantao Liu, Jin Ye, Jianing Yu, Lei Hou, Juanzi Li

arXiv: 2510.02209v1 - DOI (cs.LG)

License: NONEXCLUSIVE-DISTRIB 1.0

Abstract: Large language models (LLMs) have recently demonstrated strong capabilities as autonomous agents, showing promise in reasoning, tool use, and sequential decision-making. While prior benchmarks have evaluated LLM agents in domains such as software engineering and scientific discovery, the finance domain remains underexplored, despite its direct relevance to economic value and high-stakes decision-making. Existing financial benchmarks primarily test static knowledge through question answering, but they fall short of capturing the dynamic and iterative nature of trading. To address this gap, we introduce StockBench, a contamination-free benchmark designed to evaluate LLM agents in realistic, multi-month stock trading environments. Agents receive daily market signals -- including prices, fundamentals, and news -- and must make sequential buy, sell, or hold decisions. Performance is assessed using financial metrics such as cumulative return, maximum drawdown, and the Sortino ratio. Our evaluation of state-of-the-art proprietary (e.g., GPT-5, Claude-4) and open-weight (e.g., Qwen3, Kimi-K2, GLM-4.5) models shows that while most LLM agents struggle to outperform the simple buy-and-hold baseline, several models demonstrate the potential to deliver higher returns and manage risk more effectively. These findings highlight both the challenges and opportunities in developing LLM-powered financial agents, showing that excelling at static financial knowledge tasks does not necessarily translate into successful trading strategies. We release StockBench as an open-source resource to support reproducibility and advance future research in this domain.

Submitted to arXiv on 02 Oct. 2025

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

⚠The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2510.02209v1

⚠This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

Comprehensive Summary
Key points
Layman's Summary
Blog article

In their recent study titled "StockBench: Can LLM Agents Trade Stocks Profitably In Real-world Markets? ", authors Yanxu Chen, Zijun Yao, Yantao Liu, Jin Ye, Jianing Yu, Lei Hou, and Juanzi Li explore the capabilities of Large Language Models (LLMs) as autonomous agents in the finance domain. The study evaluates both state-of-the-art proprietary models like GPT-5 and Claude-4 as well as open-weight models such as Qwen3, Kimi-K2, and GLM-4.5 within the StockBench framework. <br> , specifically LLMs have shown promise in various domains such as reasoning and tool use. However, their application in high-stakes decision-making areas like remains underexplored. Existing financial benchmarks primarily focus on static knowledge assessment through question answering tasks but fail to capture the dynamic and iterative nature of trading. To bridge this gap, the authors introduce , a contamination-free benchmark specifically designed to evaluate LLM agents in realistic multi-month stock trading environments.<br> In this setup, agents receive daily market signals encompassing prices, fundamentals and news updates and are required to make sequential buy-sell-hold decisions. Performance evaluation is based on financial metrics including cumulative return, maximum drawdown and the Sortino ratio.<br> The results indicate that while most LLM agents struggle to outperform a simple buy-and-hold strategy baseline, exhibit potential for delivering higher returns and managing risk more effectively. This highlights the challenges and opportunities associated with developing LLM-powered financial agents; excelling at static financial knowledge tasks does not necessarily translate into successful trading strategies.<br> In conclusion, the authors release StockBench as an open-source resource to facilitate reproducibility and further research advancements in the field of LLM-powered financial agent development. This comprehensive study sheds light on the complexities involved in leveraging language models for real-world stock trading applications while emphasizing the need for innovative approaches to enhance performance in dynamic trading environments.

- Study titled "StockBench: Can LLM Agents Trade Stocks Profitably In Real-world Markets?" explores Large Language Models (LLMs) in finance domain.
- Evaluates proprietary models like GPT-5 and Claude-4, as well as open-weight models such as Qwen3, Kimi-K2, and GLM-4.5 within StockBench framework.
- Introduces StockBench benchmark to evaluate LLM agents in realistic multi-month stock trading environments.
- Agents make sequential buy-sell-hold decisions based on daily market signals including prices, fundamentals, and news updates.
- Performance evaluation based on financial metrics like cumulative return, maximum drawdown, and Sortino ratio.
- Results show most LLM agents struggle to outperform simple buy-and-hold strategy but exhibit potential for higher returns and better risk management.
- Authors release StockBench as open-source resource to facilitate reproducibility and further research in LLM-powered financial agent development.

Summary- A study looked at how smart computer programs called Large Language Models (LLMs) can trade stocks well in real markets. - They tested different models like GPT-5 and Claude-4, as well as Qwen3, Kimi-K2, and GLM-4.5 in a special testing system called StockBench. - StockBench helps see how good these LLMs are at trading stocks over many months. - The programs decide to buy, sell, or hold stocks based on daily signals like prices, company info, and news. - They check how well the programs do using money measures like total return and risk levels. Definitions- Large Language Models (LLMs): Smart computer programs that understand and use language well. - StockBench: A special test system for checking how good LLMs are at trading stocks. - Cumulative return: Total amount of money gained or lost over time from investing. - Maximum drawdown: Biggest loss in value experienced by an investment before it starts gaining again. - Sortino ratio: Measure of investment performance that considers only downside risk.

Introduction

Large Language Models (LLMs) have been making headlines in recent years for their impressive performance in various natural language processing tasks. These models, such as GPT-5 and Claude-4, have shown great potential in areas like reasoning and tool use. However, their application in high-stakes decision-making domains like finance remains relatively unexplored. In the study titled "StockBench: Can LLM Agents Trade Stocks Profitably In Real-world Markets?", authors Yanxu Chen, Zijun Yao, Yantao Liu, Jin Ye, Jianing Yu, Lei Hou, and Juanzi Li delve into the capabilities of LLMs as autonomous agents in the finance domain. The study evaluates both state-of-the-art proprietary models and open-weight models within the StockBench framework to determine their effectiveness in real-world stock trading environments.

The Need for StockBench

Existing financial benchmarks primarily focus on static knowledge assessment through question answering tasks. While these benchmarks are useful for evaluating an agent's understanding of financial concepts and data interpretation skills, they fail to capture the dynamic nature of stock trading. This is where StockBench comes into play – a contamination-free benchmark specifically designed to evaluate LLM agents' performance in realistic multi-month stock trading environments.

The Setup

In this setup, agents receive daily market signals encompassing prices, fundamentals and news updates. They are then required to make sequential buy-sell-hold decisions based on this information. Performance evaluation is based on financial metrics including cumulative return, maximum drawdown and the Sortino ratio.

The Results

The results of the study indicate that while most LLM agents struggle to outperform a simple buy-and-hold strategy baseline, some show potential for delivering higher returns and managing risk more effectively. This highlights the challenges involved in developing successful LLM-powered financial agents. Simply excelling at static financial knowledge tasks does not necessarily translate into successful trading strategies.

Implications and Future Directions

The study's findings have significant implications for the development of LLM-powered financial agents. It highlights the complexities involved in leveraging language models for real-world stock trading applications and emphasizes the need for innovative approaches to enhance performance in dynamic trading environments. Additionally, the authors release StockBench as an open-source resource to facilitate reproducibility and further research advancements in this field. This will allow researchers to build upon their work and develop more robust LLM agents that can navigate the complexities of stock trading successfully.

Conclusion

In conclusion, "StockBench: Can LLM Agents Trade Stocks Profitably In Real-world Markets?" is a comprehensive study that sheds light on the capabilities of LLMs as autonomous agents in finance. The results highlight both the challenges and opportunities associated with developing LLM-powered financial agents, emphasizing the need for innovative approaches to enhance performance in dynamic trading environments. The release of StockBench as an open-source benchmark also provides a valuable resource for future research advancements in this field.

Created on 06 Oct. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Similar papers summarized with our AI tools

75.8%

Deep Q-Learning Market Makers in a Multi-Agent Simulated Stock Market

cs.LG

75.4%

Practical Deep Reinforcement Learning Approach for Stock Trading

cs.LG

71.8%

AI Agents That Matter

cs.LG

71.4%

LeanDojo: Theorem Proving with Retrieval-Augmented Language Models

cs.LG

71.1%

QLAMMP: A Q-Learning Agent for Optimizing Fees on Automated Market Making Pro…

cs.LG

71.0%

Coercing LLMs to do and reveal (almost) anything

cs.LG

70.5%

Efficient Exploration for LLMs

cs.LG

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.