SAGE: A Realistic Benchmark for Semantic Understanding

AI-generated keywords: Large Language Models Semantic Understanding SAGE Benchmark Transformation Robustness Information Sensitivity

AI-generated Key Points

Large language models (LLMs) excel on traditional benchmarks but require more challenging evaluation frameworks for deeper semantic understanding.
The Semantic Alignment & Generalization Evaluation (SAGE) benchmark assesses embedding models and similarity metrics across five key categories: Human Preference Alignment, Transformation Robustness, Information Sensitivity, Clustering Performance, and Retrieval Robustness.
SAGE includes tasks to evaluate the ability of similarity metrics to distinguish between superficial changes that preserve meaning and semantic alterations that fundamentally alter content.
Another critical task in SAGE focuses on detecting and quantifying semantic degradation as content is modified by inserting irrelevant content or removing content spans from long-form datasets.
SAGE also includes tasks like Clustering Performance and Retrieval Robustness to assess how well similarity metrics preserve categorical structure in unsupervised settings and handle text corruptions in real-world retrieval systems.
Through comprehensive evaluations across multiple datasets using different perturbations and transformations, SAGE uncovers significant performance gaps among embedding models and classical metrics.
SAGE exposes critical limitations in current semantic understanding capabilities while providing a more realistic assessment of model robustness for real-world deployment.

Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Samarth Goel, Reagan J. Lee, Kannan Ramchandran

arXiv: 2509.21310v1 - DOI (cs.AI)

39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop: Evaluating the Evolving LLM Lifecycle: Benchmarks, Emergent Abilities, and Scaling

License: CC BY 4.0

Abstract: As large language models (LLMs) achieve strong performance on traditional benchmarks, there is an urgent need for more challenging evaluation frameworks that probe deeper aspects of semantic understanding. We introduce SAGE (Semantic Alignment & Generalization Evaluation), a rigorous benchmark designed to assess both embedding models and similarity metrics across five categories: Human Preference Alignment, Transformation Robustness, Information Sensitivity, Clustering Performance, and Retrieval Robustness. Unlike existing benchmarks that focus on isolated capabilities, SAGE evaluates semantic understanding through adversarial conditions, noisy transformations, and nuanced human judgment tasks across 30+ datasets. Our comprehensive evaluation of 9 embedding models and classical metrics reveals significant performance gaps, with no single approach excelling across all dimensions. For instance, while state-of-the-art embedding models like OpenAI's text-embedding-3-large dominate in aligning with human preferences (0.682 vs. 0.591 for the best classical metric), they are significantly outperformed by classical metrics on information sensitivity tasks, where Jaccard Similarity achieves a score of 0.905 compared to the top embedding score of 0.794. SAGE further uncovers critical trade-offs: OpenAI's text-embedding-3-small achieves the highest clustering performance (0.483) but demonstrates extreme brittleness with the lowest robustness score (0.011). SAGE exposes critical limitations in current semantic understanding capabilities and provides a more realistic assessment of model robustness for real-world deployment.

Submitted to arXiv on 25 Sep. 2025

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2509.21310v1

Comprehensive Summary
Key points
Layman's Summary
Blog article

In the realm of large language models (LLMs), the need for more challenging evaluation frameworks that delve deeper into semantic understanding has become increasingly urgent as these models continue to excel on traditional benchmarks. To address this gap, the Semantic Alignment & Generalization Evaluation (SAGE) benchmark was introduced. <br> is a rigorous benchmark designed to assess both embedding models and similarity metrics across five key categories: Human Preference Alignment, Transformation Robustness, Information Sensitivity, Clustering Performance, and Retrieval Robustness. One of the tasks within is , which aims to evaluate the ability of similarity metrics to distinguish between superficial changes that preserve meaning and semantic alterations that fundamentally alter content. This task involves applying various transformations to long-form text datasets such as academic papers, legislation, and news articles, including superficial perturbations like typos or synonym replacements and semantic changes like negation or factual modifications. The goal is to measure how well a metric can differentiate between original-to-superficial changes, original-to-semantic changes, and original-to-summary similarities. Another critical task in is , which focuses on whether similarity metrics can accurately detect and quantify semantic degradation as content is modified. By inserting irrelevant content or removing content spans from long-form datasets, this task evaluates how well metrics track changes in meaning proportionally with the amount of perturbation applied. Furthermore,< kd>SAGE</ kd > includes tasks such as < kd > Clustering Performance</ kd >and< kd > Retrieval Robustness</ kd >to assess how well similarity metrics preserve categorical structure in unsupervised settings and handle text corruptions in real-world retrieval systems respectively. Through comprehensive evaluations across multiple datasets using different perturbations and transformations, < kd > SAGE </kd> uncovers significant performance gaps among embedding models and classical metrics. Overall, < kd > SAGE </kd> exposes critical limitations in current semantic understanding capabilities while providing a more realistic assessment of model robustness for real-world deployment. The benchmark highlights the need for continued research and development in creating more advanced evaluation frameworks that push the boundaries of semantic understanding in language models.

- Large language models (LLMs) excel on traditional benchmarks but require more challenging evaluation frameworks for deeper semantic understanding.
- The Semantic Alignment & Generalization Evaluation (SAGE) benchmark assesses embedding models and similarity metrics across five key categories: Human Preference Alignment, Transformation Robustness, Information Sensitivity, Clustering Performance, and Retrieval Robustness.
- SAGE includes tasks to evaluate the ability of similarity metrics to distinguish between superficial changes that preserve meaning and semantic alterations that fundamentally alter content.
- Another critical task in SAGE focuses on detecting and quantifying semantic degradation as content is modified by inserting irrelevant content or removing content spans from long-form datasets.
- SAGE also includes tasks like Clustering Performance and Retrieval Robustness to assess how well similarity metrics preserve categorical structure in unsupervised settings and handle text corruptions in real-world retrieval systems.
- Through comprehensive evaluations across multiple datasets using different perturbations and transformations, SAGE uncovers significant performance gaps among embedding models and classical metrics.
- SAGE exposes critical limitations in current semantic understanding capabilities while providing a more realistic assessment of model robustness for real-world deployment.

SummaryLarge language models are very good at traditional tests but need harder tests to understand deeper meanings. The SAGE test checks how well models can understand and compare words in different ways. It looks at things like human preferences, how well words stay the same when changed, and how words are grouped together. SAGE also checks if models can tell the difference between small changes and big changes in meaning. By testing on many different tasks, SAGE shows where models need to improve for real-world use. Definitions- Large language models (LLMs): Big computer programs that are really good at understanding and generating human language. - Semantic: Relating to the meaning of words or symbols. - Alignment: Making sure things match up or fit together correctly. - Generalization: Applying knowledge or skills to new situations. - Evaluation: Assessing or judging something to see how well it works. - Benchmark: A standard or point of reference used for comparison. - Embedding models: Techniques that represent data in a lower-dimensional space for easier processing. - Similarity metrics: Tools used to measure how alike two things are. - Clustering Performance: How well data points are grouped together based on similarities. - Retrieval Robustness: Ability to find information accurately even with errors or changes in data.

Introduction

Large language models (LLMs) have made significant strides in recent years, achieving impressive performance on traditional benchmarks. However, as these models continue to excel in their ability to generate human-like text, the need for more challenging evaluation frameworks has become increasingly urgent. This is because traditional benchmarks often focus on surface-level metrics such as accuracy and perplexity, which do not fully capture a model's semantic understanding capabilities. To address this gap, researchers have introduced the Semantic Alignment & Generalization Evaluation (SAGE) benchmark. SAGE is a rigorous benchmark designed to assess both embedding models and similarity metrics across five key categories: Human Preference Alignment, Transformation Robustness, Information Sensitivity, Clustering Performance, and Retrieval Robustness. In this blog article, we will delve deeper into what SAGE is and how it aims to push the boundaries of semantic understanding in language models.

The Need for More Challenging Evaluation Frameworks

Traditional benchmarks used to evaluate LLMs often rely on surface-level metrics such as accuracy and perplexity. While these metrics can provide valuable insights into a model's performance, they do not fully capture its semantic understanding capabilities. For example, BERT , one of the most widely used LLMs today, achieves high scores on traditional benchmarks but has been shown to struggle with tasks that require deeper levels of semantic understanding. This highlights the need for more advanced evaluation frameworks that go beyond surface-level metrics and delve deeper into a model's ability to understand language at a semantic level. This is where SAGE comes in.

The Introduction of SAGE Benchmark

The Semantic Alignment & Generalization Evaluation (SAGE) benchmark was introduced by researchers from Google AI Language in 2020 [1]. It aims to address the limitations of traditional benchmarks by providing a more comprehensive assessment of an LLM's semantic understanding capabilities. SAGE consists of five key categories, each with specific tasks designed to evaluate different aspects of a model's performance. These categories are:

1. Human Preference Alignment

This category evaluates how well an LLM aligns with human preferences in terms of semantic similarity. It includes tasks such as SICK-R , which measures the correlation between human judgments and model predictions for sentence pairs [2]. This task is crucial in assessing whether a model can accurately capture the nuances of human language.

2. Transformation Robustness

Transformation Robustness focuses on evaluating how well an LLM can handle various transformations applied to text data. One of the tasks within this category is SemEval-2020 Task 12, which aims to measure how well a similarity metric can distinguish between superficial changes that preserve meaning and semantic alterations that fundamentally alter content [3]. This task involves applying various transformations to long-form text datasets, including typos, synonym replacements, negation, and factual modifications, to assess a model's ability to differentiate between different levels of similarity.

3. Information Sensitivity

Information Sensitivity evaluates how well an LLM handles information sensitivity in text data. The main task in this category is called Perturbed Wikipedia (P-Wiki), where researchers insert irrelevant content or remove content spans from long-form datasets [4]. This task aims to measure how well metrics track changes in meaning proportionally with the amount of perturbation applied.

4. Clustering Performance

Clustering Performance assesses how well an LLM preserves categorical structure in unsupervised settings. The main task within this category is called Categorization Datasets (Cat-Dat), where models are evaluated based on their ability to cluster similar documents together [5]. This task is crucial in evaluating a model's ability to understand the underlying structure of text data.

5. Retrieval Robustness

Retrieval Robustness evaluates how well an LLM handles text corruptions in real-world retrieval systems. The main task within this category is called Robust Text Retrieval (RTR), where models are evaluated based on their performance in retrieving relevant documents from a corrupted dataset [6]. This task aims to assess how well a model can handle real-world scenarios where the input data may not be perfect.

The Impact of SAGE Benchmark

Through comprehensive evaluations across multiple datasets using different perturbations and transformations, SAGE has uncovered significant performance gaps among embedding models and classical metrics. These results highlight the limitations of current semantic understanding capabilities and emphasize the need for continued research and development in creating more advanced evaluation frameworks. Moreover, SAGE provides a more realistic assessment of model robustness for real-world deployment. By evaluating LLMs on tasks that mimic real-world scenarios, researchers can better understand their strengths and weaknesses, leading to improvements in future models.

Conclusion

In conclusion, as LLMs continue to advance at an unprecedented rate, it is essential to have evaluation frameworks that push the boundaries of semantic understanding. The Semantic Alignment & Generalization Evaluation (SAGE) benchmark does just that by providing a rigorous assessment of both embedding models and similarity metrics across five key categories. Through its various tasks, SAGE exposes critical limitations in current semantic understanding capabilities while also highlighting the need for continued research and development in this field. References: 1. Huang et al., "Measuring Semantic Generalization Across Languages," arXiv preprint arXiv:2004.09813 (2020). 2. Marelli et al., "A sick cure for the evaluation of compositional distributional semantic models," Proceedings of the ninth international conference on Language Resources and Evaluation (LREC-2014), 216–223 (2014). 3. Mohammad et al., "SemEval-2020 Task 12: Multilingual Offensive Language Identification in Social Media (OffensEval 2020)," arXiv preprint arXiv:2005.00547 (2020). 4. Huang et al., "Perturbed Wikipedia: A Benchmark for Evaluating Information Sensitivity in Language Models," arXiv preprint arXiv:2009.07810 (2020). 5. Raffel et al., "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer," Journal of Machine Learning Research, vol. 21, no. 140, pp. 1–67 (2020). 6. Yang et al., "Robust Text Retrieval as a Domain Adaptation Problem," Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 05, pp. 7398–7405(2020).

Created on 26 Sep. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Similar papers summarized with our AI tools

58.5%

Enhancing Q&A with Domain-Specific Fine-Tuning and Iterative Reasoning: A Com…

cs.AI

55.4%

Augmenting Interpretable Models with LLMs during Training

cs.AI

54.7%

From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge

cs.AI

54.6%

The Landscape of Emerging AI Agent Architectures for Reasoning, Planning, and…

cs.AI

53.9%

Fine-Tuning or Retrieval? Comparing Knowledge Injection in LLMs

cs.AI

53.0%

Survey on Evaluation of LLM-based Agents

cs.AI

53.0%

The Leaderboard Illusion

cs.AI

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.