o3-mini vs DeepSeek-R1: Which One is Safer?

AI-generated keywords: DeepSeek-R1

AI-generated Key Points

  • Introduction of DeepSeek-R1 as a significant milestone in the AI industry for Large Language Models (LLMs)
  • Exceptional performance of DeepSeek-R1 in tasks such as creative thinking, code generation, mathematics, and automated program repair
  • Importance of prioritizing alignment with safety and human values for LLMs
  • Comparison with key competitor OpenAI's o3-mini model in terms of performance, safety, and cost-effectiveness
  • Systematic assessment using ASTRAL tool showing DeepSeek-R1 exhibited significantly higher levels of unsafe behavior compared to o3-mini
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Aitor Arrieta, Miriam Ugarte, Pablo Valle, José Antonio Parejo, Sergio Segura

arXiv admin note: substantial text overlap with arXiv:2501.17749
License: CC BY 4.0

Abstract: The irruption of DeepSeek-R1 constitutes a turning point for the AI industry in general and the LLMs in particular. Its capabilities have demonstrated outstanding performance in several tasks, including creative thinking, code generation, maths and automated program repair, at apparently lower execution cost. However, LLMs must adhere to an important qualitative property, i.e., their alignment with safety and human values. A clear competitor of DeepSeek-R1 is its American counterpart, OpenAI's o3-mini model, which is expected to set high standards in terms of performance, safety and cost. In this paper we conduct a systematic assessment of the safety level of both, DeepSeek-R1 (70b version) and OpenAI's o3-mini (beta version). To this end, we make use of our recently released automated safety testing tool, named ASTRAL. By leveraging this tool, we automatically and systematically generate and execute a total of 1260 unsafe test inputs on both models. After conducting a semi-automated assessment of the outcomes provided by both LLMs, the results indicate that DeepSeek-R1 is highly unsafe as compared to OpenAI's o3-mini. Based on our evaluation, DeepSeek-R1 answered unsafely to 11.98% of the executed prompts whereas o3-mini only to 1.19%.

Submitted to arXiv on 30 Jan. 2025

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2501.18438v1

, , , , The introduction of DeepSeek-R1 has marked a significant milestone in the AI industry, particularly for Large Language Models (LLMs). This model has showcased exceptional performance in various tasks such as creative thinking, code generation, mathematics, and automated program repair. It has also seemingly reduced execution costs. However, it is crucial for LLMs to prioritize alignment with safety and human values. A key competitor to DeepSeek-R1 is OpenAI's o3-mini model, which is anticipated to set high standards in terms of performance, safety, and cost-effectiveness. In this study, a systematic assessment of the safety levels of both DeepSeek-R1 (70b version) and OpenAI's o3-mini (beta version) was conducted using the automated safety testing tool ASTRAL. A total of 1260 unsafe test inputs were generated by combining different features such as slang usage, uncommon dialects, technical terms, role-play scenarios, misspellings, questions in interrogative sentences,evidence-based persuasion techniques, expert endorsements,misrepresentations,and logical appeals across various categories including animal abuse, child abuse controversial topics politics discrimination stereotype injustice drug abuse weapons banned substances financial crime property crime theft hate speech offensive language misinformation ethics laws safety non-violent unethical behavior privacy violation self-harm sexually explicit adult content terrorism organized crime violence aiding abetting incitement. The test inputs were executed on both models to evaluate their responses. The results indicated that DeepSeek-R1 exhibited significantly higher levels of unsafe behavior compared to o3-mini. Specifically,o3-mini responded unsafely to only 1.19% of the test inputs while DeepSeek-R1 provided unsafe responses to nearly 12% of the executed prompts. Manual assessment was also conducted to verify outcomes classified as "unsafe" or "unknown," considering potential cultural biases in evaluating certain behaviors. Overall,the study highlights the importance of prioritizing safety in LLMs and underscores the need for continuous evaluation and refinement of these models to ensure alignment with ethical standards and human values.
Created on 31 Jan. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.