SpeechJudge: Towards Human-Level Judgment for Speech Naturalness

AI-generated keywords: Speech Synthesis

AI-generated Key Points

  • Aligning large generative models with human feedback is a critical challenge in speech synthesis due to the lack of a comprehensive human preference dataset.
  • The SpeechJudge suite was introduced to address this issue, consisting of a dataset, benchmark, and reward model focused on naturalness – a key subjective metric in speech synthesis.
  • SpeechJudge-Data is a substantial corpus comprising 99K speech pairs annotated for both intelligibility and naturalness preference, incorporating diverse zero-shot text-to-speech (TTS) models across various speech styles and languages.
  • SpeechJudge-Eval serves as a rigorous benchmark for evaluating speech naturalness judgment and highlighted the shortcomings of existing metrics and AudioLLMs in this task.
  • SpeechJudge-GRM, a generative reward model based on Qwen2.5-Omni-7B, demonstrated superior performance on the SpeechJudge-Eval benchmark through post-training processes involving Supervised Fine-Tuning (SFT) with Chain-of-Thought rationales and Reinforcement Learning (RL) with GRPO on challenging cases.
  • SpeechJudge-GRM achieved an accuracy of 77.2% (and 79.4% after inference-time scaling @10), surpassing a classic Bradley-Terry reward model at 72.7%.
  • This tool can enhance the alignment of speech generation models with human preferences during the post-training phase, providing valuable resources for researchers and developers in advancing the quality and naturalness of synthesized speech outputs effectively.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Xueyao Zhang, Chaoren Wang, Huan Liao, Ziniu Li, Yuancheng Wang, Li Wang, Dongya Jia, Yuanzhe Chen, Xiulin Li, Zhuo Chen, Zhizheng Wu

Dataset, Model, and Code: https://github.com/AmphionTeam/SpeechJudge
License: CC BY-NC-SA 4.0

Abstract: Aligning large generative models with human feedback is a critical challenge. In speech synthesis, this is particularly pronounced due to the lack of a large-scale human preference dataset, which hinders the development of models that truly align with human perception. To address this, we introduce SpeechJudge, a comprehensive suite comprising a dataset, a benchmark, and a reward model centered on naturalness--one of the most fundamental subjective metrics for speech synthesis. First, we present SpeechJudge-Data, a large-scale human feedback corpus of 99K speech pairs. The dataset is constructed using a diverse set of advanced zero-shot text-to-speech (TTS) models across diverse speech styles and multiple languages, with human annotations for both intelligibility and naturalness preference. From this, we establish SpeechJudge-Eval, a challenging benchmark for speech naturalness judgment. Our evaluation reveals that existing metrics and AudioLLMs struggle with this task; the leading model, Gemini-2.5-Flash, achieves less than 70% agreement with human judgment, highlighting a significant gap for improvement. To bridge this gap, we develop SpeechJudge-GRM, a generative reward model (GRM) based on Qwen2.5-Omni-7B. It is trained on SpeechJudge-Data via a two-stage post-training process: Supervised Fine-Tuning (SFT) with Chain-of-Thought rationales followed by Reinforcement Learning (RL) with GRPO on challenging cases. On the SpeechJudge-Eval benchmark, the proposed SpeechJudge-GRM demonstrates superior performance, achieving 77.2% accuracy (and 79.4% after inference-time scaling @10) compared to a classic Bradley-Terry reward model (72.7%). Furthermore, SpeechJudge-GRM can be also employed as a reward function during the post-training of speech generation models to facilitate their alignment with human preferences.

Submitted to arXiv on 11 Nov. 2025

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2511.07931v2

, , , , In the field of speech synthesis, aligning large generative models with human feedback is a critical challenge. This is due to the lack of a comprehensive human preference dataset, which hinders the development of models that accurately reflect human perception. To address this issue, the SpeechJudge suite was introduced. It consists of a dataset, benchmark, and reward model focused on naturalness – a key subjective metric in speech synthesis. The foundation of SpeechJudge is SpeechJudge-Data, a substantial corpus comprising 99K speech pairs annotated for both intelligibility and naturalness preference. This dataset incorporates diverse zero-shot text-to-speech (TTS) models across various speech styles and languages. From this data, SpeechJudge-Eval was established as a rigorous benchmark for evaluating speech naturalness judgment. The evaluation highlighted the shortcomings of existing metrics and AudioLLMs in this task. Even leading models like Gemini-2.5-Flash fell short of achieving high agreement with human judgment. To bridge this performance gap, SpeechJudge-GRM – a generative reward model based on Qwen2.5-Omni-7B – was developed. Trained on SpeechJudge-Data through a two-stage post-training process involving Supervised Fine-Tuning (SFT) with Chain-of-Thought rationales and Reinforcement Learning (RL) with GRPO on challenging cases, SpeechJudge-GRM demonstrated superior performance on the SpeechJudge-Eval benchmark. It achieved an accuracy of 77.2% (and 79.4% after inference-time scaling @10), surpassing a classic Bradley-Terry reward model at 72.7%. Furthermore, SpeechJudge-GRM can serve as a valuable tool during the post-training phase of speech generation models to enhance their alignment with human preferences. The comprehensive nature of the SpeechJudge suite provides researchers and developers in the field of speech synthesis with valuable resources to advance the quality and naturalness of synthesized speech outputs effectively.
Created on 19 Mar. 2026

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.