SAGE: A Realistic Benchmark for Semantic Understanding

AI-generated keywords: Large Language Models Semantic Understanding SAGE Benchmark Transformation Robustness Information Sensitivity

AI-generated Key Points

  • Large language models (LLMs) excel on traditional benchmarks but require more challenging evaluation frameworks for deeper semantic understanding.
  • The Semantic Alignment & Generalization Evaluation (SAGE) benchmark assesses embedding models and similarity metrics across five key categories: Human Preference Alignment, Transformation Robustness, Information Sensitivity, Clustering Performance, and Retrieval Robustness.
  • SAGE includes tasks to evaluate the ability of similarity metrics to distinguish between superficial changes that preserve meaning and semantic alterations that fundamentally alter content.
  • Another critical task in SAGE focuses on detecting and quantifying semantic degradation as content is modified by inserting irrelevant content or removing content spans from long-form datasets.
  • SAGE also includes tasks like Clustering Performance and Retrieval Robustness to assess how well similarity metrics preserve categorical structure in unsupervised settings and handle text corruptions in real-world retrieval systems.
  • Through comprehensive evaluations across multiple datasets using different perturbations and transformations, SAGE uncovers significant performance gaps among embedding models and classical metrics.
  • SAGE exposes critical limitations in current semantic understanding capabilities while providing a more realistic assessment of model robustness for real-world deployment.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Samarth Goel, Reagan J. Lee, Kannan Ramchandran

39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop: Evaluating the Evolving LLM Lifecycle: Benchmarks, Emergent Abilities, and Scaling
License: CC BY 4.0

Abstract: As large language models (LLMs) achieve strong performance on traditional benchmarks, there is an urgent need for more challenging evaluation frameworks that probe deeper aspects of semantic understanding. We introduce SAGE (Semantic Alignment & Generalization Evaluation), a rigorous benchmark designed to assess both embedding models and similarity metrics across five categories: Human Preference Alignment, Transformation Robustness, Information Sensitivity, Clustering Performance, and Retrieval Robustness. Unlike existing benchmarks that focus on isolated capabilities, SAGE evaluates semantic understanding through adversarial conditions, noisy transformations, and nuanced human judgment tasks across 30+ datasets. Our comprehensive evaluation of 9 embedding models and classical metrics reveals significant performance gaps, with no single approach excelling across all dimensions. For instance, while state-of-the-art embedding models like OpenAI's text-embedding-3-large dominate in aligning with human preferences (0.682 vs. 0.591 for the best classical metric), they are significantly outperformed by classical metrics on information sensitivity tasks, where Jaccard Similarity achieves a score of 0.905 compared to the top embedding score of 0.794. SAGE further uncovers critical trade-offs: OpenAI's text-embedding-3-small achieves the highest clustering performance (0.483) but demonstrates extreme brittleness with the lowest robustness score (0.011). SAGE exposes critical limitations in current semantic understanding capabilities and provides a more realistic assessment of model robustness for real-world deployment.

Submitted to arXiv on 25 Sep. 2025

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2509.21310v1

In the realm of large language models (LLMs), the need for more challenging evaluation frameworks that delve deeper into semantic understanding has become increasingly urgent as these models continue to excel on traditional benchmarks. To address this gap, the Semantic Alignment & Generalization Evaluation (SAGE) benchmark was introduced. <br> is a rigorous benchmark designed to assess both embedding models and similarity metrics across five key categories: Human Preference Alignment, Transformation Robustness, Information Sensitivity, Clustering Performance, and Retrieval Robustness. One of the tasks within is , which aims to evaluate the ability of similarity metrics to distinguish between superficial changes that preserve meaning and semantic alterations that fundamentally alter content. This task involves applying various transformations to long-form text datasets such as academic papers, legislation, and news articles, including superficial perturbations like typos or synonym replacements and semantic changes like negation or factual modifications. The goal is to measure how well a metric can differentiate between original-to-superficial changes, original-to-semantic changes, and original-to-summary similarities. Another critical task in is , which focuses on whether similarity metrics can accurately detect and quantify semantic degradation as content is modified. By inserting irrelevant content or removing content spans from long-form datasets, this task evaluates how well metrics track changes in meaning proportionally with the amount of perturbation applied. Furthermore,< kd>SAGE</ kd > includes tasks such as < kd > Clustering Performance</ kd >and< kd > Retrieval Robustness</ kd >to assess how well similarity metrics preserve categorical structure in unsupervised settings and handle text corruptions in real-world retrieval systems respectively. Through comprehensive evaluations across multiple datasets using different perturbations and transformations, < kd > SAGE </kd> uncovers significant performance gaps among embedding models and classical metrics. Overall, < kd > SAGE </kd> exposes critical limitations in current semantic understanding capabilities while providing a more realistic assessment of model robustness for real-world deployment. The benchmark highlights the need for continued research and development in creating more advanced evaluation frameworks that push the boundaries of semantic understanding in language models.
Created on 26 Sep. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.