WenetSpeech: A 10000+ Hours Multi-domain Mandarin Corpus for Speech Recognition

AI-generated keywords: WenetSpeech

AI-generated Key Points

Introduction of WenetSpeech, a large multi-domain Mandarin corpus for enhancing Mandarin speech recognition
Composition of WenetSpeech: over 10,000 hours of labeled speech data, 2,400+ hours of weakly labeled speech data, and approximately 10,000 hours of unlabeled speech data
Data collection from various sources such as YouTube and Podcasts to encompass diverse speaking styles, scenarios, domains, topics, and noisy conditions
Methods used for generating audio/text segmentation candidates for YouTube and Podcast data: OCR-based method for YouTube and ASR transcription system for Podcasts
Proposal of a novel end-to-end label error detection approach to validate and filter the generated candidates
Provision of three manually labeled high-quality test sets (Dev, Test_Net, Test_Meeting) along with WenetSpeech for evaluation purposes
Training baseline systems with WenetSpeech provided for popular speech recognition toolkits like Kaldi, ESPnet, and WeNet
Comparison with existing Mandarin corpora like AIShell-2 and GigaSpeech to highlight the gap filled by WenetSpeech in providing a comprehensive dataset for research on production-level Mandarin speech recognition systems
Aim of releasing WenetSpeech to address limitations of current open-source corpora by offering a larger scale dataset with diverse content sourced from the internet

Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Binbin Zhang, Hang Lv, Pengcheng Guo, Qijie Shao, Chao Yang, Lei Xie, Xin Xu, Hui Bu, Xiaoyu Chen, Chenchen Zeng, Di Wu, Zhendong Peng

arXiv: 2110.03370v1 - DOI (cs.SD)

License: CC BY 4.0

Abstract: In this paper, we present WenetSpeech, a multi-domain Mandarin corpus consisting of 10000+ hours high-quality labeled speech, 2400+ hours weakly labeled speech, and about 10000 hours unlabeled speech, with 22400+ hours in total. We collect the data from YouTube and Podcast, which covers a variety of speaking styles, scenarios, domains, topics, and noisy conditions. An optical character recognition (OCR) based method is introduced to generate the audio/text segmentation candidates for the YouTube data on its corresponding video captions, while a high-quality ASR transcription system is used to generate audio/text pair candidates for the Podcast data. Then we propose a novel end-to-end label error detection approach to further validate and filter the candidates. We also provide three manually labelled high-quality test sets along with WenetSpeech for evaluation -- Dev for cross-validation purpose in training, Test_Net, collected from Internet for matched test, and Test\_Meeting, recorded from real meetings for more challenging mismatched test. Baseline systems trained with WenetSpeech are provided for three popular speech recognition toolkits, namely Kaldi, ESPnet, and WeNet, and recognition results on the three test sets are also provided as benchmarks. To the best of our knowledge, WenetSpeech is the current largest open-sourced Mandarin speech corpus with transcriptions, which benefits research on production-level speech recognition.

Submitted to arXiv on 07 Oct. 2021

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2110.03370v1

Comprehensive Summary
Key points
Layman's Summary
Blog article

, , , , In this paper, the authors introduce WenetSpeech, a large multi-domain Mandarin corpus aimed at enhancing Mandarin speech recognition. The corpus consists of over 10,000 hours of labeled speech data, 2,400+ hours of weakly labeled speech data, and approximately 10,000 hours of unlabeled speech data, totaling over 22,400 hours. The data is collected from various sources such as YouTube and Podcasts, encompassing diverse speaking styles, scenarios, domains, topics, and noisy conditions. To generate audio/text segmentation candidates for the YouTube data, an optical character recognition (OCR) based method is employed using video captions. For the Podcast data, a high-quality automatic speech recognition (ASR) transcription system is utilized to create audio/text pair candidates. A novel end-to-end label error detection approach is then proposed to validate and filter these candidates. The authors also provide three manually labeled high-quality test sets along with WenetSpeech for evaluation purposes: Dev for cross-validation in training, Test_Net collected from the Internet for matched testing, and Test_Meeting recorded from real meetings for more challenging mismatched testing. Baseline systems trained with WenetSpeech are provided for popular speech recognition toolkits like Kaldi, ESPnet, and WeNet. Recognition results on the three test sets serve as benchmarks. Compared to existing Mandarin corpora like AIShell-2 which includes 1,000 hours of speech recorded in quiet environments with limited domain diversity and GigaSpeech which offers a large-scale multi-domain English corpus but lacks Mandarin content; WenetSpeech fills a gap by providing a comprehensive dataset that can facilitate research on production-level Mandarin speech recognition systems. The release of WenetSpeech aims to address the limitations of current open-source corpora by offering a larger scale dataset with diverse content sourced from the internet. This new resource not only benefits researchers working on ASR systems but also contributes to advancing the development of more generalized models capable of handling complex real-world scenarios in Mandarin speech recognition.

- Introduction of WenetSpeech, a large multi-domain Mandarin corpus for enhancing Mandarin speech recognition
- Composition of WenetSpeech: over 10,000 hours of labeled speech data, 2,400+ hours of weakly labeled speech data, and approximately 10,000 hours of unlabeled speech data
- Data collection from various sources such as YouTube and Podcasts to encompass diverse speaking styles, scenarios, domains, topics, and noisy conditions
- Methods used for generating audio/text segmentation candidates for YouTube and Podcast data: OCR-based method for YouTube and ASR transcription system for Podcasts
- Proposal of a novel end-to-end label error detection approach to validate and filter the generated candidates
- Provision of three manually labeled high-quality test sets (Dev, Test_Net, Test_Meeting) along with WenetSpeech for evaluation purposes
- Training baseline systems with WenetSpeech provided for popular speech recognition toolkits like Kaldi, ESPnet, and WeNet
- Comparison with existing Mandarin corpora like AIShell-2 and GigaSpeech to highlight the gap filled by WenetSpeech in providing a comprehensive dataset for research on production-level Mandarin speech recognition systems
- Aim of releasing WenetSpeech to address limitations of current open-source corpora by offering a larger scale dataset with diverse content sourced from the internet

Summary- WenetSpeech is a big collection of Mandarin speech data to help improve understanding spoken Mandarin. - It includes over 10,000 hours of labeled speech, 2,400+ hours of weakly labeled speech, and about 10,000 hours of unlabeled speech. - The data comes from different places like YouTube and Podcasts to cover many ways people speak in different situations. - Different methods are used to break down the audio/text from YouTube and Podcasts for analysis. - WenetSpeech aims to provide better tools for recognizing Mandarin speech by offering a diverse and large dataset. Definitions- Corpus: A large collection of written or spoken texts used for research or study. - Labeled: When information or data has been marked or identified with specific details. - Unlabeled: Data that has not been categorized or marked with specific information. - Segmentation: Dividing something into smaller parts for easier analysis or understanding. - Validation: Checking if something is correct or accurate.

Introduction

Speech recognition technology has made significant advancements in recent years, with the rise of virtual assistants like Siri and Alexa. However, these systems still face challenges when it comes to accurately recognizing Mandarin speech due to its complex tonal nature and diverse speaking styles. To address this issue, researchers have been working on developing large-scale Mandarin speech corpora that can be used to train more robust and accurate speech recognition models. In this paper, the authors introduce WenetSpeech, a new multi-domain Mandarin corpus aimed at enhancing Mandarin speech recognition. This article will provide a detailed overview of the research paper, discussing the motivation behind creating WenetSpeech, its data collection process, and how it differs from existing corpora. We will also delve into the proposed methods for generating audio/text segmentation candidates and validating them through an end-to-end label error detection approach. Finally, we will discuss the provided baseline systems and evaluation results on three test sets.

Motivation

The authors highlight two main motivations for creating WenetSpeech: addressing limitations of current open-source corpora and providing a comprehensive dataset for production-level Mandarin speech recognition systems. Existing open-source corpora such as AIShell-2 only offer 1,000 hours of recorded speech in quiet environments with limited domain diversity. On the other hand, GigaSpeech provides a large-scale multi-domain English corpus but lacks content in Mandarin. This gap in available resources hinders research on production-level Mandarin speech recognition systems. Additionally, most existing corpora are collected under controlled conditions which do not reflect real-world scenarios where background noise or different speaking styles may affect accuracy. Therefore, there is a need for a larger scale dataset with diverse content sourced from the internet to facilitate research on more generalized models capable of handling complex real-world scenarios in Mandarin speech recognition.

Data Collection Process

WenetSpeech consists of over 10,000 hours of labeled speech data, 2,400+ hours of weakly labeled speech data, and approximately 10,000 hours of unlabeled speech data. The authors collected the data from various sources such as YouTube and Podcasts to ensure diversity in speaking styles, scenarios, domains, topics, and noisy conditions. For the YouTube data, an optical character recognition (OCR) based method is employed using video captions to generate audio/text segmentation candidates. This approach allows for a large amount of data to be collected quickly and efficiently. For the Podcast data, a high-quality automatic speech recognition (ASR) transcription system is utilized to create audio/text pair candidates.

End-to-End Label Error Detection

To validate and filter the generated audio/text segmentation candidates from both YouTube and Podcasts sources, the authors propose a novel end-to-end label error detection approach. This approach uses a deep neural network model trained on WenetSpeech training set with manually annotated labels. The model can detect errors in both text labels and audio segments simultaneously without relying on any external resources or language models. This method not only ensures high-quality labeling but also reduces manual effort significantly compared to traditional methods that require manual inspection of each segment individually.

Baseline Systems

The authors provide baseline systems trained with WenetSpeech for popular speech recognition toolkits like Kaldi, ESPnet, and WeNet. These systems serve as benchmarks for evaluating performance on WenetSpeech test sets: Dev for cross-validation in training; Test_Net collected from the internet for matched testing; and Test_Meeting recorded from real meetings for more challenging mismatched testing.

Evaluation Results

The results show that WenetSpeech outperforms existing corpora AIShell-2 and GigaSpeech on all three test sets in terms of word error rate (WER). This demonstrates its effectiveness in improving Mandarin speech recognition performance. The authors also provide detailed analysis and comparisons of the baseline systems, showing the potential for further improvements with more advanced models.

Conclusion

In conclusion, WenetSpeech offers a comprehensive dataset that addresses the limitations of current open-source corpora and provides a valuable resource for research on production-level Mandarin speech recognition systems. Its diverse content sourced from the internet allows for more realistic training and evaluation of models in real-world scenarios. The proposed methods for generating audio/text segmentation candidates and end-to-end label error detection approach ensure high-quality labeling while reducing manual effort. The provided baseline systems and evaluation results serve as benchmarks for future research in this field. Overall, WenetSpeech contributes to advancing the development of more robust and accurate Mandarin speech recognition technology.

Created on 16 Nov. 2024

Assess the quality of the AI-generated content by voting

Score: 0

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

Similar papers summarized with our AI tools

59.4%

HierSpeech++: Bridging the Gap between Semantic and Acoustic Representation o…

cs.SD

55.4%

Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Ke…

cs.SD

54.2%

OpenVoice: Versatile Instant Voice Cloning

cs.SD

54.0%

HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classificat…

cs.SD

53.1%

AST: Audio Spectrogram Transformer

cs.SD

53.0%

Improving Speaker Diarization using Semantic Information: Joint Pairwise Cons…

cs.SD

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.