EnDive: A Cross-Dialect Benchmark for Fairness and Performance in Large Language Models

AI-generated keywords: EnDive Cross-Dialect Benchmark Fairness Performance Large Language Models

AI-generated Key Points

The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

  • Authors address challenges of diversity in human language on NLP systems
  • Existing benchmarks often overlook intra-language variations and neglect non-standard dialects
  • Introduction of EnDive benchmark to assess LLM performance across various tasks
  • Translation of datasets into underrepresented dialects using few-shot prompting techniques with input from native speakers
  • Comparison against rule-based methods through fluency assessments, preference tests, and semantic similarity metrics
  • Human evaluations validate high translation quality achieved by EnDive
  • EnDive exposes significant disparities in performance between standard American English inputs and dialectal inputs across different large language models, highlighting inherent biases within these systems
  • Research aims to promote inclusivity and fairness in language processing technologies by uncovering model biases and advocating for advancements
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Abhay Gupta, Jacob Cheung, Philip Meng, Shayan Sayyed, Austen Liao, Kevin Zhu, Sean O'Brien

Abstract: The diversity of human language, shaped by social, cultural, and regional influences, presents significant challenges for natural language processing (NLP) systems. Existing benchmarks often overlook intra-language variations, leaving speakers of non-standard dialects underserved. To address this gap, we introduce EnDive (English Diversity), a benchmark that evaluates five widely-used large language models (LLMs) across tasks in language understanding, algorithmic reasoning, mathematics, and logic. Our framework translates Standard American English datasets into five underrepresented dialects using few-shot prompting with verified examples from native speakers, and compare these translations against rule-based methods via fluency assessments, preference tests, and semantic similarity metrics. Human evaluations confirm high translation quality, with average scores of at least 6.02/7 for faithfulness, fluency, and formality. By filtering out near-identical translations, we create a challenging dataset that reveals significant performance disparities - models consistently underperform on dialectal inputs compared to Standard American English. EnDive thus advances dialect-aware NLP by uncovering model biases and promoting more equitable language technologies.

Submitted to arXiv on 25 Feb. 2025

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2504.07100v1

This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

In their paper titled "EnDive: A Cross-Dialect Benchmark for Fairness and Performance in Large Language Models," authors Abhay Gupta, Jacob Cheung, Philip Meng, Shayan Sayyed, Austen Liao, Kevin Zhu, and Sean O'Brien address the challenges posed by the diversity of human language on natural language processing (NLP) systems. They highlight how existing benchmarks often fail to account for intra-language variations, thereby neglecting speakers of non-standard dialects. To bridge this gap, the authors introduce EnDive (English Diversity), a comprehensive benchmark that assesses the performance of five popular large language models (LLMs) across various tasks such as language understanding, algorithmic reasoning, mathematics, and logic. The framework developed by the authors involves translating datasets from Standard American English into five underrepresented dialects using few-shot prompting techniques with input from native speakers to ensure accuracy. These translations are then compared against rule-based methods through fluency assessments, preference tests, and semantic similarity metrics. Human evaluations validate the high translation quality achieved by EnDive, with average scores exceeding 6 out of 7 for faithfulness, fluency, and formality. By curating a challenging dataset that exposes significant disparities in performance between standard American English inputs and dialectal inputs across different large language models, EnDive sheds light on inherent biases within these systems. The findings underscore the need for more equitable and dialect-aware NLP technologies. Through their research efforts, the authors aim to promote inclusivity and fairness in language processing technologies by uncovering model biases and advocating for advancements in this field.
Created on 28 Jun. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.