, , , ,
In this paper, the authors introduce WenetSpeech, a large multi-domain Mandarin corpus aimed at enhancing Mandarin speech recognition. The corpus consists of over 10,000 hours of labeled speech data, 2,400+ hours of weakly labeled speech data, and approximately 10,000 hours of unlabeled speech data, totaling over 22,400 hours. The data is collected from various sources such as YouTube and Podcasts, encompassing diverse speaking styles, scenarios, domains, topics, and noisy conditions. To generate audio/text segmentation candidates for the YouTube data, an optical character recognition (OCR) based method is employed using video captions. For the Podcast data, a high-quality automatic speech recognition (ASR) transcription system is utilized to create audio/text pair candidates. A novel end-to-end label error detection approach is then proposed to validate and filter these candidates. The authors also provide three manually labeled high-quality test sets along with WenetSpeech for evaluation purposes: Dev for cross-validation in training, Test_Net collected from the Internet for matched testing, and Test_Meeting recorded from real meetings for more challenging mismatched testing. Baseline systems trained with WenetSpeech are provided for popular speech recognition toolkits like Kaldi, ESPnet, and WeNet. Recognition results on the three test sets serve as benchmarks. Compared to existing Mandarin corpora like AIShell-2 which includes 1,000 hours of speech recorded in quiet environments with limited domain diversity and GigaSpeech which offers a large-scale multi-domain English corpus but lacks Mandarin content; WenetSpeech fills a gap by providing a comprehensive dataset that can facilitate research on production-level Mandarin speech recognition systems. The release of WenetSpeech aims to address the limitations of current open-source corpora by offering a larger scale dataset with diverse content sourced from the internet. This new resource not only benefits researchers working on ASR systems but also contributes to advancing the development of more generalized models capable of handling complex real-world scenarios in Mandarin speech recognition.
- - Introduction of WenetSpeech, a large multi-domain Mandarin corpus for enhancing Mandarin speech recognition
- - Composition of WenetSpeech: over 10,000 hours of labeled speech data, 2,400+ hours of weakly labeled speech data, and approximately 10,000 hours of unlabeled speech data
- - Data collection from various sources such as YouTube and Podcasts to encompass diverse speaking styles, scenarios, domains, topics, and noisy conditions
- - Methods used for generating audio/text segmentation candidates for YouTube and Podcast data: OCR-based method for YouTube and ASR transcription system for Podcasts
- - Proposal of a novel end-to-end label error detection approach to validate and filter the generated candidates
- - Provision of three manually labeled high-quality test sets (Dev, Test_Net, Test_Meeting) along with WenetSpeech for evaluation purposes
- - Training baseline systems with WenetSpeech provided for popular speech recognition toolkits like Kaldi, ESPnet, and WeNet
- - Comparison with existing Mandarin corpora like AIShell-2 and GigaSpeech to highlight the gap filled by WenetSpeech in providing a comprehensive dataset for research on production-level Mandarin speech recognition systems
- - Aim of releasing WenetSpeech to address limitations of current open-source corpora by offering a larger scale dataset with diverse content sourced from the internet
Summary- WenetSpeech is a big collection of Mandarin speech data to help improve understanding spoken Mandarin.
- It includes over 10,000 hours of labeled speech, 2,400+ hours of weakly labeled speech, and about 10,000 hours of unlabeled speech.
- The data comes from different places like YouTube and Podcasts to cover many ways people speak in different situations.
- Different methods are used to break down the audio/text from YouTube and Podcasts for analysis.
- WenetSpeech aims to provide better tools for recognizing Mandarin speech by offering a diverse and large dataset.
Definitions- Corpus: A large collection of written or spoken texts used for research or study.
- Labeled: When information or data has been marked or identified with specific details.
- Unlabeled: Data that has not been categorized or marked with specific information.
- Segmentation: Dividing something into smaller parts for easier analysis or understanding.
- Validation: Checking if something is correct or accurate.
Introduction
Speech recognition technology has made significant advancements in recent years, with the rise of virtual assistants like Siri and Alexa. However, these systems still face challenges when it comes to accurately recognizing Mandarin speech due to its complex tonal nature and diverse speaking styles. To address this issue, researchers have been working on developing large-scale Mandarin speech corpora that can be used to train more robust and accurate speech recognition models.
In this paper, the authors introduce WenetSpeech, a new multi-domain Mandarin corpus aimed at enhancing Mandarin speech recognition. This article will provide a detailed overview of the research paper, discussing the motivation behind creating WenetSpeech, its data collection process, and how it differs from existing corpora. We will also delve into the proposed methods for generating audio/text segmentation candidates and validating them through an end-to-end label error detection approach. Finally, we will discuss the provided baseline systems and evaluation results on three test sets.
Motivation
The authors highlight two main motivations for creating WenetSpeech: addressing limitations of current open-source corpora and providing a comprehensive dataset for production-level Mandarin speech recognition systems.
Existing open-source corpora such as AIShell-2 only offer 1,000 hours of recorded speech in quiet environments with limited domain diversity. On the other hand, GigaSpeech provides a large-scale multi-domain English corpus but lacks content in Mandarin. This gap in available resources hinders research on production-level Mandarin speech recognition systems.
Additionally, most existing corpora are collected under controlled conditions which do not reflect real-world scenarios where background noise or different speaking styles may affect accuracy. Therefore, there is a need for a larger scale dataset with diverse content sourced from the internet to facilitate research on more generalized models capable of handling complex real-world scenarios in Mandarin speech recognition.
Data Collection Process
WenetSpeech consists of over 10,000 hours of labeled speech data, 2,400+ hours of weakly labeled speech data, and approximately 10,000 hours of unlabeled speech data. The authors collected the data from various sources such as YouTube and Podcasts to ensure diversity in speaking styles, scenarios, domains, topics, and noisy conditions.
For the YouTube data, an optical character recognition (OCR) based method is employed using video captions to generate audio/text segmentation candidates. This approach allows for a large amount of data to be collected quickly and efficiently. For the Podcast data, a high-quality automatic speech recognition (ASR) transcription system is utilized to create audio/text pair candidates.
End-to-End Label Error Detection
To validate and filter the generated audio/text segmentation candidates from both YouTube and Podcasts sources, the authors propose a novel end-to-end label error detection approach. This approach uses a deep neural network model trained on WenetSpeech training set with manually annotated labels. The model can detect errors in both text labels and audio segments simultaneously without relying on any external resources or language models.
This method not only ensures high-quality labeling but also reduces manual effort significantly compared to traditional methods that require manual inspection of each segment individually.
Baseline Systems
The authors provide baseline systems trained with WenetSpeech for popular speech recognition toolkits like Kaldi, ESPnet, and WeNet. These systems serve as benchmarks for evaluating performance on WenetSpeech test sets: Dev for cross-validation in training; Test_Net collected from the internet for matched testing; and Test_Meeting recorded from real meetings for more challenging mismatched testing.
Evaluation Results
The results show that WenetSpeech outperforms existing corpora AIShell-2 and GigaSpeech on all three test sets in terms of word error rate (WER). This demonstrates its effectiveness in improving Mandarin speech recognition performance. The authors also provide detailed analysis and comparisons of the baseline systems, showing the potential for further improvements with more advanced models.
Conclusion
In conclusion, WenetSpeech offers a comprehensive dataset that addresses the limitations of current open-source corpora and provides a valuable resource for research on production-level Mandarin speech recognition systems. Its diverse content sourced from the internet allows for more realistic training and evaluation of models in real-world scenarios. The proposed methods for generating audio/text segmentation candidates and end-to-end label error detection approach ensure high-quality labeling while reducing manual effort. The provided baseline systems and evaluation results serve as benchmarks for future research in this field. Overall, WenetSpeech contributes to advancing the development of more robust and accurate Mandarin speech recognition technology.