Think Twice: Enhancing LLM Reasoning by Scaling Multi-round Test-time Thinking

AI-generated keywords: Large Language Models Test-time Scaling Multi-round Thinking Performance Improvement Reinforcement Learning

AI-generated Key Points

⚠The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

The paper explores advancements in large language models (LLMs) like OpenAI-o1 and DeepSeek-R1, focusing on test-time scaling to enhance model performance through extended reasoning processes.
Current LLMs face challenges with handling long texts and efficient training with reinforcement learning (RL).
The authors propose a straightforward yet powerful approach called Multi-round Thinking to address these limitations.
Extensive experiments involving models such as QwQ-32B and DeepSeek-R1 consistently showed performance improvements across benchmarks like AIME 2024, MATH-500, GPQA-diamond, and LiveCodeBench.
Implementation of Multi-round Thinking led to increased accuracy in models like QwQ-32B (from 80.3% to 82.1%) and DeepSeek-R1 (from 79.7% to 82.0%) on datasets like AIME 2024.
Results highlight the effectiveness of Multi-round Thinking in enhancing model performance across various tasks and datasets, showcasing its broad applicability.
Multi-round Thinking offers a promising avenue for achieving stable enhancements in LLM performance by leveraging previous answers to guide subsequent reasoning processes.

Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Xiaoyu Tian, Sitong Zhao, Haotian Wang, Shuaiting Chen, Yunjie Ji, Yiping Peng, Han Zhao, Xiangang Li

arXiv: 2503.19855v1 - DOI (cs.CL)

License: NONEXCLUSIVE-DISTRIB 1.0

Abstract: Recent advances in large language models (LLMs), such as OpenAI-o1 and DeepSeek-R1, have demonstrated the effectiveness of test-time scaling, where extended reasoning processes substantially enhance model performance. Despite this, current models are constrained by limitations in handling long texts and reinforcement learning (RL) training efficiency. To address these issues, we propose a simple yet effective test-time scaling approach Multi-round Thinking. This method iteratively refines model reasoning by leveraging previous answers as prompts for subsequent rounds. Extensive experiments across multiple models, including QwQ-32B and DeepSeek-R1, consistently show performance improvements on various benchmarks such as AIME 2024, MATH-500, GPQA-diamond, and LiveCodeBench. For instance, the accuracy of QwQ-32B improved from 80.3% (Round 1) to 82.1% (Round 2) on the AIME 2024 dataset, while DeepSeek-R1 showed a similar increase from 79.7% to 82.0%. These results confirm that Multi-round Thinking is a broadly applicable, straightforward approach to achieving stable enhancements in model performance, underscoring its potential for future developments in test-time scaling techniques. The key prompt: {Original question prompt} The assistant's previous answer is: <answer> {last round answer} </answer>, and please re-answer.

Submitted to arXiv on 25 Mar. 2025

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

⚠The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2503.19855v1

⚠This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

Comprehensive Summary
Key points
Layman's Summary
Blog article

The paper "Think Twice: Enhancing LLM Reasoning by Scaling Multi-round Test-time Thinking" authored by Xiaoyu Tian, Sitong Zhao, Haotian Wang, Shuaiting Chen, Yunjie Ji, Yiping Peng, Han Zhao, and Xiangang Li explores the advancements in large language models (LLMs) like OpenAI-o1 and DeepSeek-R1. These models have shown the effectiveness of test-time scaling in enhancing model performance through extended reasoning processes. However, current models face challenges in handling long texts and efficiently training with reinforcement learning (RL). To overcome these limitations, the authors propose a straightforward yet powerful approach called Multi-round Thinking. Through extensive experiments involving various models such as QwQ-32B and DeepSeek-R1, the researchers consistently observed performance improvements across different benchmarks like AIME 2024, MATH-500, GPQA-diamond, and LiveCodeBench. For example, on the AIME 2024 dataset,the accuracy of QwQ-32B increased from 80.3% in Round 1 to 82.1% in Round 2 after implementing Multi-round Thinking. Similarly, DeepSeek-R1 also showed a significant improvement from 79.7% to 82.0%. These results highlight the broad applicability and effectiveness of Multi-round Thinking in enhancing model performance. The study underscores the potential of this approach for future developments in test-time scaling techniques.By leveraging previous answers to guide subsequent reasoning processes,Multi-round Thinking offers a promising avenue for achieving stable enhancements in LLM performance across various tasks and datasets.

- The paper explores advancements in large language models (LLMs) like OpenAI-o1 and DeepSeek-R1, focusing on test-time scaling to enhance model performance through extended reasoning processes.
- Current LLMs face challenges with handling long texts and efficient training with reinforcement learning (RL).
- The authors propose a straightforward yet powerful approach called Multi-round Thinking to address these limitations.
- Extensive experiments involving models such as QwQ-32B and DeepSeek-R1 consistently showed performance improvements across benchmarks like AIME 2024, MATH-500, GPQA-diamond, and LiveCodeBench.
- Implementation of Multi-round Thinking led to increased accuracy in models like QwQ-32B (from 80.3% to 82.1%) and DeepSeek-R1 (from 79.7% to 82.0%) on datasets like AIME 2024.
- Results highlight the effectiveness of Multi-round Thinking in enhancing model performance across various tasks and datasets, showcasing its broad applicability.
- Multi-round Thinking offers a promising avenue for achieving stable enhancements in LLM performance by leveraging previous answers to guide subsequent reasoning processes.

Summary- The paper talks about making big language models better by helping them think longer and smarter. - Big language models have trouble with long texts and learning efficiently using reinforcement learning. - The authors suggest a simple but strong way called Multi-round Thinking to fix these problems. - Tests with different models like QwQ-32B and DeepSeek-R1 showed they got better at answering questions in tests. - Using Multi-round Thinking made the models more accurate on tests like AIME 2024. Definitions- Advancements: Improvements or progress in technology or knowledge. - Large Language Models (LLMs): Big computer programs that can understand and generate human-like text. - Test-time scaling: Making a model perform better during testing by allowing it to reason for longer periods. - Reinforcement Learning (RL): A type of machine learning where the model learns through trial and error, getting rewards for correct actions.

Introduction

The field of natural language processing (NLP) has seen significant advancements in recent years, with large language models (LLMs) like OpenAI-o1 and DeepSeek-R1 showing impressive performance on various tasks. These models have been trained on massive amounts of data and can generate human-like text responses. However, one area where these models still face challenges is in their reasoning abilities. In order to improve the reasoning capabilities of LLMs, researchers Xiaoyu Tian, Sitong Zhao, Haotian Wang, Shuaiting Chen, Yunjie Ji, Yiping Peng, Han Zhao, and Xiangang Li have proposed a new approach called Multi-round Thinking. This approach aims to enhance model performance through extended test-time thinking processes.

The Need for Test-time Scaling

Test-time scaling refers to the process of extending the inference time by allowing more rounds of thinking for a given input text. This technique has shown promising results in improving model performance by enabling deeper understanding and reasoning over longer texts. However, current LLMs face limitations in handling long texts efficiently and training with reinforcement learning (RL). RL is an essential component in test-time scaling as it guides the model's decision-making process during each round of thinking. Thus, there is a need for alternative approaches that can overcome these challenges while still achieving significant improvements in model performance.

The Multi-round Thinking Approach

Multi-round Thinking is a straightforward yet powerful approach that leverages previous answers to guide subsequent rounds of reasoning. It involves breaking down long texts into smaller segments and using reinforcement learning techniques to train the model on each segment separately. During inference time, the model first generates an answer based on its initial understanding of the input text. Then, this answer is used as input for subsequent rounds of thinking where the model refines its understanding by considering additional information from the text. This process continues until the model reaches a stable answer.

Experimental Results

To evaluate the effectiveness of Multi-round Thinking, the researchers conducted extensive experiments on various LLMs such as QwQ-32B and DeepSeek-R1. They used multiple benchmarks, including AIME 2024, MATH-500, GPQA-diamond, and LiveCodeBench, to test the models' performance across different tasks and datasets. The results consistently showed significant improvements in model performance after implementing Multi-round Thinking. For example, on the AIME 2024 dataset, QwQ-32B's accuracy increased from 80.3% in Round 1 to 82.1% in Round 2. Similarly, DeepSeek-R1 also showed a notable improvement from 79.7% to 82.0%. These results highlight the broad applicability and effectiveness of Multi-round Thinking in enhancing LLM performance across various tasks and datasets.

Conclusion

In conclusion, Xiaoyu Tian et al.'s paper "Think Twice: Enhancing LLM Reasoning by Scaling Multi-round Test-time Thinking" introduces an innovative approach for improving reasoning capabilities in large language models through extended test-time thinking processes. Their proposed method of Multi-round Thinking offers a promising avenue for achieving stable enhancements in LLM performance across various tasks and datasets by leveraging previous answers to guide subsequent rounds of reasoning. This research opens up new possibilities for future developments in test-time scaling techniques that can overcome current limitations faced by LLMs while still achieving significant improvements in model performance.

Created on 06 Oct. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Similar papers summarized with our AI tools

82.1%

Crosslingual Reasoning through Test-Time Scaling

cs.CL

82.1%

Scaling Relationship on Learning Mathematical Reasoning with Large Language M…

cs.CL

81.8%

Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling

cs.CL

81.1%

Rethinking the Bounds of LLM Reasoning: Are Multi-Agent Discussions the Key?

cs.CL

81.0%

Challenges and Responses in the Practice of Large Language Models

cs.CL

79.8%

Large language models effectively leverage document-level context for literar…

cs.CL

79.6%

QuALITY: Question Answering with Long Input Texts, Yes!

cs.CL

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.