SpeechJudge: Towards Human-Level Judgment for Speech Naturalness

AI-generated keywords: Speech Synthesis

AI-generated Key Points

Aligning large generative models with human feedback is a critical challenge in speech synthesis due to the lack of a comprehensive human preference dataset.
The SpeechJudge suite was introduced to address this issue, consisting of a dataset, benchmark, and reward model focused on naturalness – a key subjective metric in speech synthesis.
SpeechJudge-Data is a substantial corpus comprising 99K speech pairs annotated for both intelligibility and naturalness preference, incorporating diverse zero-shot text-to-speech (TTS) models across various speech styles and languages.
SpeechJudge-Eval serves as a rigorous benchmark for evaluating speech naturalness judgment and highlighted the shortcomings of existing metrics and AudioLLMs in this task.
SpeechJudge-GRM, a generative reward model based on Qwen2.5-Omni-7B, demonstrated superior performance on the SpeechJudge-Eval benchmark through post-training processes involving Supervised Fine-Tuning (SFT) with Chain-of-Thought rationales and Reinforcement Learning (RL) with GRPO on challenging cases.
SpeechJudge-GRM achieved an accuracy of 77.2% (and 79.4% after inference-time scaling @10), surpassing a classic Bradley-Terry reward model at 72.7%.
This tool can enhance the alignment of speech generation models with human preferences during the post-training phase, providing valuable resources for researchers and developers in advancing the quality and naturalness of synthesized speech outputs effectively.

Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Xueyao Zhang, Chaoren Wang, Huan Liao, Ziniu Li, Yuancheng Wang, Li Wang, Dongya Jia, Yuanzhe Chen, Xiulin Li, Zhuo Chen, Zhizheng Wu

arXiv: 2511.07931v2 - DOI (cs.SD)

Dataset, Model, and Code: https://github.com/AmphionTeam/SpeechJudge

License: CC BY-NC-SA 4.0

Abstract: Aligning large generative models with human feedback is a critical challenge. In speech synthesis, this is particularly pronounced due to the lack of a large-scale human preference dataset, which hinders the development of models that truly align with human perception. To address this, we introduce SpeechJudge, a comprehensive suite comprising a dataset, a benchmark, and a reward model centered on naturalness--one of the most fundamental subjective metrics for speech synthesis. First, we present SpeechJudge-Data, a large-scale human feedback corpus of 99K speech pairs. The dataset is constructed using a diverse set of advanced zero-shot text-to-speech (TTS) models across diverse speech styles and multiple languages, with human annotations for both intelligibility and naturalness preference. From this, we establish SpeechJudge-Eval, a challenging benchmark for speech naturalness judgment. Our evaluation reveals that existing metrics and AudioLLMs struggle with this task; the leading model, Gemini-2.5-Flash, achieves less than 70% agreement with human judgment, highlighting a significant gap for improvement. To bridge this gap, we develop SpeechJudge-GRM, a generative reward model (GRM) based on Qwen2.5-Omni-7B. It is trained on SpeechJudge-Data via a two-stage post-training process: Supervised Fine-Tuning (SFT) with Chain-of-Thought rationales followed by Reinforcement Learning (RL) with GRPO on challenging cases. On the SpeechJudge-Eval benchmark, the proposed SpeechJudge-GRM demonstrates superior performance, achieving 77.2% accuracy (and 79.4% after inference-time scaling @10) compared to a classic Bradley-Terry reward model (72.7%). Furthermore, SpeechJudge-GRM can be also employed as a reward function during the post-training of speech generation models to facilitate their alignment with human preferences.

Submitted to arXiv on 11 Nov. 2025

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2511.07931v2

Comprehensive Summary
Key points
Layman's Summary
Blog article

, , , , In the field of speech synthesis, aligning large generative models with human feedback is a critical challenge. This is due to the lack of a comprehensive human preference dataset, which hinders the development of models that accurately reflect human perception. To address this issue, the SpeechJudge suite was introduced. It consists of a dataset, benchmark, and reward model focused on naturalness – a key subjective metric in speech synthesis. The foundation of SpeechJudge is SpeechJudge-Data, a substantial corpus comprising 99K speech pairs annotated for both intelligibility and naturalness preference. This dataset incorporates diverse zero-shot text-to-speech (TTS) models across various speech styles and languages. From this data, SpeechJudge-Eval was established as a rigorous benchmark for evaluating speech naturalness judgment. The evaluation highlighted the shortcomings of existing metrics and AudioLLMs in this task. Even leading models like Gemini-2.5-Flash fell short of achieving high agreement with human judgment. To bridge this performance gap, SpeechJudge-GRM – a generative reward model based on Qwen2.5-Omni-7B – was developed. Trained on SpeechJudge-Data through a two-stage post-training process involving Supervised Fine-Tuning (SFT) with Chain-of-Thought rationales and Reinforcement Learning (RL) with GRPO on challenging cases, SpeechJudge-GRM demonstrated superior performance on the SpeechJudge-Eval benchmark. It achieved an accuracy of 77.2% (and 79.4% after inference-time scaling @10), surpassing a classic Bradley-Terry reward model at 72.7%. Furthermore, SpeechJudge-GRM can serve as a valuable tool during the post-training phase of speech generation models to enhance their alignment with human preferences. The comprehensive nature of the SpeechJudge suite provides researchers and developers in the field of speech synthesis with valuable resources to advance the quality and naturalness of synthesized speech outputs effectively.

- Aligning large generative models with human feedback is a critical challenge in speech synthesis due to the lack of a comprehensive human preference dataset.
- The SpeechJudge suite was introduced to address this issue, consisting of a dataset, benchmark, and reward model focused on naturalness – a key subjective metric in speech synthesis.
- SpeechJudge-Data is a substantial corpus comprising 99K speech pairs annotated for both intelligibility and naturalness preference, incorporating diverse zero-shot text-to-speech (TTS) models across various speech styles and languages.
- SpeechJudge-Eval serves as a rigorous benchmark for evaluating speech naturalness judgment and highlighted the shortcomings of existing metrics and AudioLLMs in this task.
- SpeechJudge-GRM, a generative reward model based on Qwen2.5-Omni-7B, demonstrated superior performance on the SpeechJudge-Eval benchmark through post-training processes involving Supervised Fine-Tuning (SFT) with Chain-of-Thought rationales and Reinforcement Learning (RL) with GRPO on challenging cases.
- SpeechJudge-GRM achieved an accuracy of 77.2% (and 79.4% after inference-time scaling @10), surpassing a classic Bradley-Terry reward model at 72.7%.
- This tool can enhance the alignment of speech generation models with human preferences during the post-training phase, providing valuable resources for researchers and developers in advancing the quality and naturalness of synthesized speech outputs effectively.

Summary1. Making big computer programs that talk like humans better with people's opinions is hard because there isn't enough data on what people like. 2. A special tool called SpeechJudge was made to help with this problem, including a dataset, test, and model focusing on how natural the speech sounds. 3. The SpeechJudge-Data has lots of recordings with notes on how clear and natural they sound, using different talking styles and languages. 4. The SpeechJudge-Eval is a tough test for checking how well the computer can judge if speech sounds natural, finding issues with current methods. 5. The SpeechJudge-GRM model did really well in tests by learning from examples and getting rewards for good guesses. Definitions- Generative models: Computer programs that create things like text or speech based on patterns they learn. - Naturalness: How much something sounds like it was spoken by a real person. - Dataset: A collection of information used for studying or training computer programs. - Benchmark: A standard or goal used to measure performance or quality. - Reward model: A system that gives points or rewards to a program for making good decisions.

Introduction

Speech synthesis, also known as text-to-speech (TTS), is a rapidly growing field with various applications such as virtual assistants, audiobooks, and accessibility tools for the visually impaired. However, one of the biggest challenges in this field is aligning large generative models with human feedback. This is due to the lack of a comprehensive human preference dataset that accurately reflects human perception. To address this issue, researchers have introduced SpeechJudge – a suite consisting of a dataset, benchmark, and reward model focused on naturalness – a key subjective metric in speech synthesis. In this blog article, we will delve into the details of this research paper and discuss its significance in advancing the quality and naturalness of synthesized speech outputs.

The SpeechJudge Suite

The foundation of SpeechJudge is SpeechJudge-Data – a substantial corpus comprising 99K speech pairs annotated for both intelligibility and naturalness preference. This dataset includes diverse zero-shot TTS models across various speech styles and languages. The inclusion of multiple languages ensures that the dataset is not biased towards any specific language or accent. From this data, researchers developed SpeechJudge-Eval – a rigorous benchmark for evaluating speech naturalness judgment. The evaluation highlighted the shortcomings of existing metrics and AudioLLMs (automatic listening tests) in this task. Even leading models like Gemini-2.5-Flash fell short of achieving high agreement with human judgment.

The Need for Better Metrics

Existing metrics used to evaluate TTS systems often rely on objective measures such as word error rate (WER) or mean opinion score (MOS). While these metrics can provide useful information about system performance, they do not capture subjective aspects such as naturalness accurately. AudioLLMs were introduced to overcome this limitation by incorporating human judgments into automatic evaluation methods. However, these metrics still struggle to capture subtle differences in naturalness and often fail to align with human preferences.

The Role of SpeechJudge-GRM

To bridge this performance gap, researchers developed SpeechJudge-GRM – a generative reward model based on Qwen2.5-Omni-7B. This model was trained on SpeechJudge-Data through a two-stage post-training process involving Supervised Fine-Tuning (SFT) with Chain-of-Thought rationales and Reinforcement Learning (RL) with GRPO on challenging cases. The results were impressive, with SpeechJudge-GRM achieving an accuracy of 77.2% on the SpeechJudge-Eval benchmark. After inference-time scaling at 10, the accuracy increased to 79.4%. In comparison, a classic Bradley-Terry reward model achieved an accuracy of only 72.7%.

Enhancing Alignment with Human Preferences

One of the significant implications of this research is that SpeechJudge-GRM can serve as a valuable tool during the post-training phase of speech generation models to enhance their alignment with human preferences. By incorporating this reward model into the training process, developers can improve the naturalness of synthesized speech outputs significantly.

Conclusion

In conclusion, the introduction of the SpeechJudge suite provides researchers and developers in the field of speech synthesis with valuable resources to advance the quality and naturalness of synthesized speech outputs effectively. The comprehensive nature of this suite – including a dataset, benchmark, and reward model – addresses one of the critical challenges in this field and opens up new opportunities for further advancements in TTS technology. With more accurate metrics like SpeechJudge-Eval and powerful models like SpeechJudge-GRM, we can expect significant improvements in TTS systems' naturalness in future developments. This research paper serves as an essential contribution towards bridging the gap between machine-generated speech and human perception – bringing us one step closer to achieving truly human-like synthesized speech.

Created on 19 Mar. 2026

Assess the quality of the AI-generated content by voting

Score: 0

Similar papers summarized with our AI tools

50.4%

HierSpeech++: Bridging the Gap between Semantic and Acoustic Representation o…

cs.SD

46.7%

WenetSpeech: A 10000+ Hours Multi-domain Mandarin Corpus for Speech Recogniti…

cs.SD

46.0%

Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Ke…

cs.SD

45.4%

LLark: A Multimodal Foundation Model for Music

cs.SD

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.