Ask, Don't Judge: Binary Questions for Interpretable LLM Evaluation and Self-Improvement

AI-generated keywords: LLM Evaluation BINEVAL Natural Language Processing Interpretable Framework Prompt Optimization

AI-generated Key Points

The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

  • Authors address challenges in evaluating outputs from Large Language Models (LLMs) in NLP
  • Introduce BINEVAL as a novel framework to overcome limitations in traditional human evaluation methods and existing lexical metrics
  • BINEVAL breaks down evaluation criteria into atomic binary questions for multi-dimensional scores
  • Utilizes task prompts and meta-prompts for fine-grained evaluation questions
  • Provides transparent feedback at the question level and produces calibrated overall scores
  • Decomposition of evaluation criteria makes it easier to inspect, diagnose evaluations, and facilitate prompt improvement
  • Demonstrated effectiveness across various tasks like SummEval, Topical-Chat, and QAGS
  • Compared against strong baselines like UniEval and G-Eval with superior performance on factual consistency benchmarks like QAGS
  • Exhibits competitive correlation with human judgments while better aligning with human score distributions
  • Avoids common ceiling effects observed in previous LLM evaluators, leading to improved discrimination between borderline cases and clearly flawed outputs
  • Supports iterative prompt optimization for summarization tasks and generation prompts under self-update and cross-model update scenarios
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Sangwoo Cho, Kushal Chawla, Pengshan Cai, Zefang Liu, Chenyang Zhu, Shi-Xiong Zhang, Sambit Sahu

Acceepted to the Second Workshop on Compositional Learning at ICML 2026, Seoul, South Korea
License: CC BY-NC-ND 4.0

Abstract: Evaluating LLM outputs remains a major bottleneck in NLP: human evaluation is expensive and slow, lexical metrics correlate poorly with human judgments on open-ended generation, and holistic LLM judges often produce opaque scores that are hard to debug. We propose BINEVAL, a framework that decomposes evaluation criteria into atomic binary questions and aggregates the resulting verdicts into interpretable, multi-dimensional scores. Given a task prompt, a meta-prompt generates fine-grained evaluation questions, and an LLM answers them independently for each output, yielding transparent question-level feedback together with calibrated overall scores. This decomposition makes evaluation easier to inspect, easier to diagnose, and directly usable for prompt improvement. Across SummEval, Topical-Chat, and QAGS, BINEVAL matches or outperforms strong baselines including UniEval and G-Eval, with especially strong results on factual consistency benchmarks such as QAGS. Beyond competitive correlation with human judgments, BINEVAL better matches human score distributions and avoids the ceiling effects common in prior LLM judges, leading to better discrimination between borderline and clearly flawed outputs. We further show that the same question-level feedback supports iterative prompt optimization, improving evaluator prompts on summarization and generation prompts on IFBench under both self-update and cross-model update settings. Overall, BINEVAL provides a task-agnostic, training-free, and interpretable evaluation framework that combines strong empirical performance with practical diagnostic and optimization value.

Submitted to arXiv on 25 Jun. 2026

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2606.27226v1

This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

In their paper titled "Ask, Don't Judge: Binary Questions for Interpretable LLM Evaluation and Self-Improvement," authors Sangwoo Cho, Kushal Chawla, Pengshan Cai, Zefang Liu, Chenyang Zhu, Shi-Xiong Zhang, and Sambit Sahu address the challenges associated with evaluating outputs from Large Language Models (LLMs) in Natural Language Processing (NLP). They introduce BINEVAL as a novel framework to overcome limitations in traditional human evaluation methods and existing lexical metrics. <br> BINEVAL breaks down evaluation criteria into atomic binary questions and aggregates them to generate multi-dimensional scores that are easily interpretable. By utilizing a task prompt to generate fine-grained evaluation questions through a meta-prompt, BINEVAL enables LLMs to independently answer each question for every output. This approach provides transparent feedback at the question level and produces calibrated overall scores. <br> The decomposition of evaluation criteria in BINEVAL makes it easier to inspect and diagnose evaluations while also facilitating prompt improvement by offering actionable insights based on the feedback received. The authors demonstrate the effectiveness of BINEVAL across various tasks such as SummEval, Topical-Chat, and QAGS. They compare BINEVAL against strong baselines like UniEval and G-Eval and show superior performance on factual consistency benchmarks like QAGS. <br> Furthermore, BINEVAL exhibits competitive correlation with human judgments while better aligning with human score distributions. It avoids common ceiling effects observed in previous LLM evaluators, leading to improved discrimination between borderline cases and clearly flawed outputs. The authors also illustrate how question-level feedback from BINEVAL supports iterative prompt optimization by showcasing enhancements in evaluator prompts for summarization tasks and generation prompts for IFBench under both self-update and cross-model update scenarios. <br> Overall, BINEVAL stands out as a task-agnostic, training-free, and interpretable evaluation framework that combines robust empirical performance with practical diagnostic capabilities and optimization potential. The authors' work sheds light on innovative approaches to address the challenges of evaluating LLM outputs in NLP effectively.
Created on 30 Jun. 2026

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.