Cats Confuse Reasoning LLM: Query Agnostic Adversarial Triggers for Reasoning Models

AI-generated keywords: Reasoning models Adversarial triggers CatAttack Vulnerabilities Robustness

AI-generated Key Points

  • Researchers investigate robustness of reasoning models in step-by-step problem solving
  • Introduce query-agnostic adversarial triggers to mislead models into providing incorrect answers without changing semantics
  • Present CatAttack automated attack pipeline that generates triggers on weaker proxy model and transfers them to more advanced reasoning target models
  • Transfer results in over 300% increase in likelihood of target model producing incorrect answer
  • Appending phrases like "Interesting fact: cats sleep most of their lives" to math problems doubles chances of model giving incorrect response
  • Findings expose critical vulnerabilities in reasoning models, even cutting-edge ones
  • CatAttack triggers dataset with model responses made available for further study
  • State-of-the-art reasoning models vulnerable to query-agnostic adversarial triggers that significantly elevate probability of generating incorrect outputs
  • Triggers identified on less powerful model can effectively transfer to stronger reasoning models, causing error rates to increase by more than threefold
  • Lack of inherent robustness in reasoning models against subtle adversarial manipulations highlighted
  • Adversarial triggers not only deceive models but also lead to unreasonable expansion in response length, potentially resulting in computational inefficiencies
  • Emphasizes necessity for enhanced security measures and reliability considerations when deploying reasoning models across various domains such as finance, law, and healthcare
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Meghana Rajeev, Rajkumar Ramamurthy, Prapti Trivedi, Vikas Yadav, Oluwanifemi Bamgbose, Sathwik Tejaswi Madhusudan, James Zou, Nazneen Rajani

License: CC BY-NC-SA 4.0

Abstract: We investigate the robustness of reasoning models trained for step-by-step problem solving by introducing query-agnostic adversarial triggers - short, irrelevant text that, when appended to math problems, systematically mislead models to output incorrect answers without altering the problem's semantics. We propose CatAttack, an automated iterative attack pipeline for generating triggers on a weaker, less expensive proxy model (DeepSeek V3) and successfully transfer them to more advanced reasoning target models like DeepSeek R1 and DeepSeek R1-distilled-Qwen-32B, resulting in greater than 300% increase in the likelihood of the target model generating an incorrect answer. For example, appending, "Interesting fact: cats sleep most of their lives," to any math problem leads to more than doubling the chances of a model getting the answer wrong. Our findings highlight critical vulnerabilities in reasoning models, revealing that even state-of-the-art models remain susceptible to subtle adversarial inputs, raising security and reliability concerns. The CatAttack triggers dataset with model responses is available at https://huggingface.co/datasets/collinear-ai/cat-attack-adversarial-triggers.

Submitted to arXiv on 03 Mar. 2025

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2503.01781v1

In their preprint paper under review, Meghana Rajeev, Rajkumar Ramamurthy, Prapti Trivedi, Vikas Yadav, Oluwanifemi Bamgbose, Sathwik Tejaswi Madhusudan, James Zou, and Nazneen Rajani investigate the robustness of reasoning models in step-by-step problem solving. They introduce query-agnostic adversarial triggers - short and irrelevant text that can mislead models into providing incorrect answers without changing the problem's semantics. The team presents CatAttack - an automated iterative attack pipeline that generates triggers on a weaker proxy model (DeepSeek V3) and successfully transfers them to more advanced reasoning target models like DeepSeek R1 and DeepSeek R1-distilled-Qwen-32B. This transfer results in over a 300% increase in the likelihood of the target model producing an incorrect answer. For instance, appending phrases like "Interesting fact: cats sleep most of their lives" to math problems doubles the chances of a model giving an incorrect response. These findings expose critical vulnerabilities in reasoning models and demonstrate that even cutting-edge models are susceptible to subtle adversarial inputs. The researchers make their CatAttack triggers dataset with model responses available for further study. The authors conclude that state-of-the-art reasoning models are vulnerable to query-agnostic adversarial triggers that significantly elevate the probability of generating incorrect outputs. By utilizing their automated attack pipeline, they show that triggers identified on a less powerful model can effectively transfer to stronger reasoning models such as DeepSeek R1, causing error rates to increase by more than threefold. This highlights the lack of inherent robustness in reasoning models against subtle adversarial manipulations. Furthermore, it is noted that these adversarial triggers not only deceive models but also lead to an unreasonable expansion in response length, which could potentially result in computational inefficiencies. This work emphasizes the necessity for enhanced security measures and reliability considerations when deploying reasoning models across various domains such as finance, law, and healthcare.
Created on 06 Mar. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.