Think Deep, Not Just Long: Measuring LLM Reasoning Effort via Deep-Thinking Tokens

AI-generated keywords: Large Language Models Reasoning Effort Deep-Thinking Tokens Inference-Time Effort Think@n

AI-generated Key Points

⚠The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

Authors Wei-Lin Chen, Liqian Peng, Tian Tan, Chao Zhao, Blake JianHang Chen, Ziqian Lin, Alec Go, and Yu Meng study large language models (LLMs) and their reasoning capabilities.
They introduce a novel approach to quantify inference-time effort through identifying deep-thinking tokens in LLMs.
Deep-thinking tokens represent points in the generated sequence where internal predictions undergo significant revisions in deeper model layers before converging.
The deep-thinking ratio shows a robust and consistently positive correlation with accuracy across various reasoning-focused models and benchmarks.
This metric outperforms traditional length-based and confidence-based baselines.
The authors propose Think@n - a test-time scaling strategy that prioritizes samples with high deep-thinking ratios to enhance LLM reasoning capabilities while reducing inference costs.

Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Wei-Lin Chen, Liqian Peng, Tian Tan, Chao Zhao, Blake JianHang Chen, Ziqian Lin, Alec Go, Yu Meng

arXiv: 2602.13517v1 - DOI (cs.CL)

Work in progress

License: NONEXCLUSIVE-DISTRIB 1.0

Abstract: Large language models (LLMs) have demonstrated impressive reasoning capabilities by scaling test-time compute via long Chain-of-Thought (CoT). However, recent findings suggest that raw token counts are unreliable proxies for reasoning quality: increased generation length does not consistently correlate with accuracy and may instead signal "overthinking," leading to performance degradation. In this work, we quantify inference-time effort by identifying deep-thinking tokens -- tokens where internal predictions undergo significant revisions in deeper model layers prior to convergence. Across four challenging mathematical and scientific benchmarks (AIME 24/25, HMMT 25, and GPQA-diamond) and a diverse set of reasoning-focused models (GPT-OSS, DeepSeek-R1, and Qwen3), we show that deep-thinking ratio (the proportion of deep-thinking tokens in a generated sequence) exhibits a robust and consistently positive correlation with accuracy, substantially outperforming both length-based and confidence-based baselines. Leveraging this insight, we introduce Think@n, a test-time scaling strategy that prioritizes samples with high deep-thinking ratios. We demonstrate that Think@n matches or exceeds standard self-consistency performance while significantly reducing inference costs by enabling the early rejection of unpromising generations based on short prefixes.

Submitted to arXiv on 13 Feb. 2026

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

⚠The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2602.13517v1

⚠This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

Comprehensive Summary
Key points
Layman's Summary
Blog article

In their study titled "Think Deep, Not Just Long: Measuring LLM Reasoning Effort via Deep-Thinking Tokens," authors Wei-Lin Chen, Liqian Peng, Tian Tan, Chao Zhao, Blake JianHang Chen, Ziqian Lin, Alec Go, and Yu Meng delve into the realm of large language models (LLMs) and their reasoning capabilities. The authors address the issue of unreliable raw token counts as indicators of reasoning quality in LLMs by introducing a novel approach to quantify inference-time effort through identifying deep-thinking tokens. These tokens represent points in the generated sequence where internal predictions undergo significant revisions in deeper model layers before converging. Through extensive experimentation across four challenging mathematical and scientific benchmarks using various reasoning-focused models, the authors establish that the deep-thinking ratio exhibits a robust and consistently positive correlation with accuracy. This metric outperforms traditional length-based and confidence-based baselines. Building upon this insight, they propose Think@n - a test-time scaling strategy that prioritizes samples with high deep-thinking ratios. By leveraging this strategy, they demonstrate its effectiveness in enhancing LLM reasoning capabilities while reducing inference costs. Overall, this study provides valuable insights into improving LLM performance by focusing on deep-thinking tokens rather than solely relying on generation length as a measure of quality.

- Authors Wei-Lin Chen, Liqian Peng, Tian Tan, Chao Zhao, Blake JianHang Chen, Ziqian Lin, Alec Go, and Yu Meng study large language models (LLMs) and their reasoning capabilities.
- They introduce a novel approach to quantify inference-time effort through identifying deep-thinking tokens in LLMs.
- Deep-thinking tokens represent points in the generated sequence where internal predictions undergo significant revisions in deeper model layers before converging.
- The deep-thinking ratio shows a robust and consistently positive correlation with accuracy across various reasoning-focused models and benchmarks.
- This metric outperforms traditional length-based and confidence-based baselines.
- The authors propose Think@n - a test-time scaling strategy that prioritizes samples with high deep-thinking ratios to enhance LLM reasoning capabilities while reducing inference costs.

SummaryAuthors Wei-Lin Chen, Liqian Peng, Tian Tan, Chao Zhao, Blake JianHang Chen, Ziqian Lin, Alec Go, and Yu Meng study big language models (LLMs) and how well they can think. They found a new way to measure how hard the model is working when thinking by looking at special tokens in the model. These tokens show where the model has to think deeply and make big changes before making a decision. The more deep-thinking moments there are, the better the model performs in tests that require reasoning skills. Their new method works better than older ways of measuring performance. Definitions- Authors: People who write books or research papers. - Large Language Models (LLMs): Advanced computer programs that can understand and generate human language. - Reasoning: Thinking logically to solve problems or make decisions. - Tokens: Small units of data or information within a larger system. - Inference-time effort: The amount of work a computer program needs to do when making decisions based on available information. - Deep-thinking ratio: A measure of how often a computer program has to think deeply before reaching a conclusion. - Benchmarks: Standards or points of reference used for comparison in experiments or tests. - Baselines: Standard values or methods used for comparison with new techniques or results. - Test-time scaling strategy: A plan for adjusting how much work a computer program does during testing to improve its performance.

Large language models (LLMs) have emerged as powerful tools for natural language processing tasks such as text generation, translation, and question-answering. These models are trained on massive amounts of data and can generate human-like text with impressive fluency and coherence. However, there is still room for improvement in their reasoning capabilities. In their research paper titled "Think Deep, Not Just Long: Measuring LLM Reasoning Effort via Deep-Thinking Tokens," authors Wei-Lin Chen, Liqian Peng, Tian Tan, Chao Zhao, Blake JianHang Chen, Ziqian Lin, Alec Go, and Yu Meng delve into the realm of LLMs and explore ways to improve their reasoning abilities. The authors address the issue of unreliable raw token counts as indicators of reasoning quality in LLMs by introducing a novel approach to quantify inference-time effort through identifying deep-thinking tokens. The traditional method of measuring LLM performance is based on length-based metrics such as BLEU score or perplexity. However, these metrics do not take into account the complexity of the task at hand or the amount of effort put into generating a particular sequence. This can lead to misleading results where longer sequences are considered better even if they lack logical coherence or accuracy. To overcome this limitation, the authors propose a new metric called deep-thinking ratio which measures the proportion of deep-thinking tokens in a generated sequence. These tokens represent points in the generated sequence where internal predictions undergo significant revisions in deeper model layers before converging. In simpler terms, these are moments when an LLM puts more effort into reasoning rather than simply generating text based on superficial patterns. To validate their approach and measure its effectiveness in improving LLM performance on reasoning tasks, the authors conduct extensive experiments across four challenging mathematical and scientific benchmarks using various reasoning-focused models. They compare their deep-thinking ratio metric with traditional length-based metrics like BLEU score and confidence-based baselines. The results show that the deep-thinking ratio exhibits a robust and consistently positive correlation with accuracy, outperforming other metrics. Building upon this insight, the authors propose Think@n - a test-time scaling strategy that prioritizes samples with high deep-thinking ratios. This strategy allows LLMs to focus on more challenging and complex tasks, leading to improved reasoning capabilities while reducing inference costs. The authors demonstrate the effectiveness of this strategy by conducting experiments on various datasets and showing significant improvements in performance. This study provides valuable insights into improving LLM performance by focusing on deep-thinking tokens rather than solely relying on generation length as a measure of quality. By identifying these tokens and leveraging them during inference, LLMs can put more effort into reasoning and produce more accurate and coherent outputs. In conclusion, the research paper "Think Deep, Not Just Long: Measuring LLM Reasoning Effort via Deep-Thinking Tokens" sheds light on an important aspect of LLM performance - reasoning capabilities. By introducing a novel metric for measuring inference-time effort through deep-thinking tokens and proposing an effective test-time scaling strategy, the authors provide valuable contributions towards enhancing LLM reasoning abilities. This study opens up new avenues for future research in this field and has practical implications for improving the overall performance of large language models.

Created on 25 Mar. 2026

Assess the quality of the AI-generated content by voting

Score: 0

Similar papers summarized with our AI tools

70.5%

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language M…

cs.CL

68.9%

Think Twice: Enhancing LLM Reasoning by Scaling Multi-round Test-time Thinking

cs.CL

68.5%

Token-Budget-Aware LLM Reasoning

cs.CL

68.0%

Technical Report: Large Language Models can Strategically Deceive their Users w…

cs.CL

66.3%

DeepSeek-R1 Outperforms Gemini 2.0 Pro, OpenAI o1, and o3-mini in Bilingual C…

cs.CL

65.9%

Artificial Impressions: Evaluating Large Language Model Behavior Through the Le…

cs.CL

65.4%

Demystifying Long Chain-of-Thought Reasoning in LLMs

cs.CL

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.