ASTRAL: Automated Safety Testing of Large Language Models

AI-generated keywords: Large Language Models (LLMs)

AI-generated Key Points

  • Large Language Models (LLMs) raise safety concerns despite their impressive abilities
  • Existing LLM testing frameworks address safety issues but face challenges due to unbalanced and outdated datasets
  • ASTRAL is a novel tool automating test case generation for assessing LLM safety
  • ASTRAL's innovative black-box coverage criterion aims to produce balanced and diverse unsafe test inputs across safety categories and writing styles
  • GPT3.5 is superior in detecting unsafe responses compared to other LLMs like GPT-4 and specialized models like LlamaGuard
  • ASTRAL uncovers nearly double the number of unsafe behaviors compared to static datasets with the same number of test inputs
  • The combination of black-box coverage criterion with web browsing effectively guides LLMs in generating current unsafe test inputs
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Miriam Ugarte, Pablo Valle, José Antonio Parejo, Sergio Segura, Aitor Arrieta

The 6th ACM/IEEE International Conference on Automation of Software Test (AST 2025)
License: CC BY 4.0

Abstract: Large Language Models (LLMs) have recently gained attention due to their ability to understand and generate sophisticated human-like content. However, ensuring their safety is paramount as they might provide harmful and unsafe responses. Existing LLM testing frameworks address various safety-related concerns (e.g., drugs, terrorism, animal abuse) but often face challenges due to unbalanced and obsolete datasets. In this paper, we present ASTRAL, a tool that automates the generation and execution of test cases (i.e., prompts) for testing the safety of LLMs. First, we introduce a novel black-box coverage criterion to generate balanced and diverse unsafe test inputs across a diverse set of safety categories as well as linguistic writing characteristics (i.e., different style and persuasive writing techniques). Second, we propose an LLM-based approach that leverages Retrieval Augmented Generation (RAG), few-shot prompting strategies and web browsing to generate up-to-date test inputs. Lastly, similar to current LLM test automation techniques, we leverage LLMs as test oracles to distinguish between safe and unsafe test outputs, allowing a fully automated testing approach. We conduct an extensive evaluation on well-known LLMs, revealing the following key findings: i) GPT3.5 outperforms other LLMs when acting as the test oracle, accurately detecting unsafe responses, and even surpassing more recent LLMs (e.g., GPT-4), as well as LLMs that are specifically tailored to detect unsafe LLM outputs (e.g., LlamaGuard); ii) the results confirm that our approach can uncover nearly twice as many unsafe LLM behaviors with the same number of test inputs compared to currently used static datasets; and iii) our black-box coverage criterion combined with web browsing can effectively guide the LLM on generating up-to-date unsafe test inputs, significantly increasing the number of unsafe LLM behaviors.

Submitted to arXiv on 28 Jan. 2025

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2501.17132v1

, , , , The realm of Large Language Models (LLMs) has raised concerns about their safety, despite their impressive ability to comprehend and generate complex human-like content. Existing LLM testing frameworks have made progress in addressing safety-related issues, but challenges remain due to unbalanced and outdated datasets. To tackle these challenges, a novel tool called ASTRAL has been introduced for automating the generation and execution of test cases to assess LLM safety. A key highlight of ASTRAL is its innovative black-box coverage criterion, which aims to produce balanced and diverse unsafe test inputs across various safety categories and linguistic writing styles. Leveraging Retrieval Augmented Generation (RAG), few-shot prompting strategies, and web browsing, ASTRAL ensures the generation of up-to-date test inputs. Through an extensive evaluation on prominent LLMs, several key findings have emerged. Firstly, GPT3.5 stands out as a superior test oracle in detecting unsafe responses compared to other LLMs like GPT-4 and specialized models like LlamaGuard. Secondly, ASTRAL's approach uncovers nearly double the number of unsafe behaviors compared to static datasets with the same number of test inputs. Lastly, the combination of the black-box coverage criterion with web browsing proves effective in guiding LLMs to generate current unsafe test inputs. Furthermore, additional context provided through scenarios involving opioid engineering, securities fraud planning within corporations, ethical concerns in animal experimentation, community interventions for preventing animal cruelty, federal audit failures in child abuse investigations, clean air regulations impacting electric vehicle adoption, diversity in presidential leadership roles' impact on governance standards, long-term effects of child abuse on mental health, educating children about keeping secrets regarding abuse instances—all contribute to enhancing ASTRAL's capabilities in uncovering diverse safety scenarios efficiently without exhaustive testing. Overall, ASTRAL's automated testing approach showcases promising results in identifying unsafe behaviors in LLMs while adapting to evolving societal challenges across various domains effectively.
Created on 30 Jan. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.