WenetSpeech: A 10000+ Hours Multi-domain Mandarin Corpus for Speech Recognition

AI-generated keywords: WenetSpeech

AI-generated Key Points

  • Introduction of WenetSpeech, a large multi-domain Mandarin corpus for enhancing Mandarin speech recognition
  • Composition of WenetSpeech: over 10,000 hours of labeled speech data, 2,400+ hours of weakly labeled speech data, and approximately 10,000 hours of unlabeled speech data
  • Data collection from various sources such as YouTube and Podcasts to encompass diverse speaking styles, scenarios, domains, topics, and noisy conditions
  • Methods used for generating audio/text segmentation candidates for YouTube and Podcast data: OCR-based method for YouTube and ASR transcription system for Podcasts
  • Proposal of a novel end-to-end label error detection approach to validate and filter the generated candidates
  • Provision of three manually labeled high-quality test sets (Dev, Test_Net, Test_Meeting) along with WenetSpeech for evaluation purposes
  • Training baseline systems with WenetSpeech provided for popular speech recognition toolkits like Kaldi, ESPnet, and WeNet
  • Comparison with existing Mandarin corpora like AIShell-2 and GigaSpeech to highlight the gap filled by WenetSpeech in providing a comprehensive dataset for research on production-level Mandarin speech recognition systems
  • Aim of releasing WenetSpeech to address limitations of current open-source corpora by offering a larger scale dataset with diverse content sourced from the internet
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Binbin Zhang, Hang Lv, Pengcheng Guo, Qijie Shao, Chao Yang, Lei Xie, Xin Xu, Hui Bu, Xiaoyu Chen, Chenchen Zeng, Di Wu, Zhendong Peng

License: CC BY 4.0

Abstract: In this paper, we present WenetSpeech, a multi-domain Mandarin corpus consisting of 10000+ hours high-quality labeled speech, 2400+ hours weakly labeled speech, and about 10000 hours unlabeled speech, with 22400+ hours in total. We collect the data from YouTube and Podcast, which covers a variety of speaking styles, scenarios, domains, topics, and noisy conditions. An optical character recognition (OCR) based method is introduced to generate the audio/text segmentation candidates for the YouTube data on its corresponding video captions, while a high-quality ASR transcription system is used to generate audio/text pair candidates for the Podcast data. Then we propose a novel end-to-end label error detection approach to further validate and filter the candidates. We also provide three manually labelled high-quality test sets along with WenetSpeech for evaluation -- Dev for cross-validation purpose in training, Test_Net, collected from Internet for matched test, and Test\_Meeting, recorded from real meetings for more challenging mismatched test. Baseline systems trained with WenetSpeech are provided for three popular speech recognition toolkits, namely Kaldi, ESPnet, and WeNet, and recognition results on the three test sets are also provided as benchmarks. To the best of our knowledge, WenetSpeech is the current largest open-sourced Mandarin speech corpus with transcriptions, which benefits research on production-level speech recognition.

Submitted to arXiv on 07 Oct. 2021

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2110.03370v1

, , , , In this paper, the authors introduce WenetSpeech, a large multi-domain Mandarin corpus aimed at enhancing Mandarin speech recognition. The corpus consists of over 10,000 hours of labeled speech data, 2,400+ hours of weakly labeled speech data, and approximately 10,000 hours of unlabeled speech data, totaling over 22,400 hours. The data is collected from various sources such as YouTube and Podcasts, encompassing diverse speaking styles, scenarios, domains, topics, and noisy conditions. To generate audio/text segmentation candidates for the YouTube data, an optical character recognition (OCR) based method is employed using video captions. For the Podcast data, a high-quality automatic speech recognition (ASR) transcription system is utilized to create audio/text pair candidates. A novel end-to-end label error detection approach is then proposed to validate and filter these candidates. The authors also provide three manually labeled high-quality test sets along with WenetSpeech for evaluation purposes: Dev for cross-validation in training, Test_Net collected from the Internet for matched testing, and Test_Meeting recorded from real meetings for more challenging mismatched testing. Baseline systems trained with WenetSpeech are provided for popular speech recognition toolkits like Kaldi, ESPnet, and WeNet. Recognition results on the three test sets serve as benchmarks. Compared to existing Mandarin corpora like AIShell-2 which includes 1,000 hours of speech recorded in quiet environments with limited domain diversity and GigaSpeech which offers a large-scale multi-domain English corpus but lacks Mandarin content; WenetSpeech fills a gap by providing a comprehensive dataset that can facilitate research on production-level Mandarin speech recognition systems. The release of WenetSpeech aims to address the limitations of current open-source corpora by offering a larger scale dataset with diverse content sourced from the internet. This new resource not only benefits researchers working on ASR systems but also contributes to advancing the development of more generalized models capable of handling complex real-world scenarios in Mandarin speech recognition.
Created on 16 Nov. 2024

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.