Vec-Tok Speech: speech vectorization and tokenization for neural speech generation

AI-generated keywords: Language Models Speech Generative Models Vec-Tok Speech Neural Speech Generation High-Fidelity Output

AI-generated Key Points

The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

  • Language models (LMs) have advanced in natural language processing and computer vision, producing high-quality texts and images.
  • Speech generative models have faced challenges in achieving comparable speech quality and task generalization.
  • Vec-Tok Speech is introduced by a team of researchers to support multiple speech generation tasks with expressive and high-fidelity speech.
  • The framework integrates speech vectors for acoustic details and semantic tokens for linguistic content to enhance language modeling capabilities.
  • Vec-Tok Speech utilizes an LM at its core for efficient speech generation, incorporating Byte-Pair Encoding (BPE) to reduce token length and bit rate.
  • The versatility of Vec-Tok Speech extends to applications like voice conversion, speaking style transfer, translation, denoising, and speaker de-identification/anonymization.
  • Experimental results show that Vec-Tok Speech outperforms other models when trained on a substantial dataset of 50k hours of speech data.
  • The code for Vec-Tok Speech will be available on GitHub at https://github.com/BakerBunker/VecTok.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Xinfa Zhu, Yuanjun Lv, Yi Lei, Tao Li, Wendi He, Hongbin Zhou, Heng Lu, Lei Xie

15 pages, 2 figures

Abstract: Language models (LMs) have recently flourished in natural language processing and computer vision, generating high-fidelity texts or images in various tasks. In contrast, the current speech generative models are still struggling regarding speech quality and task generalization. This paper presents Vec-Tok Speech, an extensible framework that resembles multiple speech generation tasks, generating expressive and high-fidelity speech. Specifically, we propose a novel speech codec based on speech vectors and semantic tokens. Speech vectors contain acoustic details contributing to high-fidelity speech reconstruction, while semantic tokens focus on the linguistic content of speech, facilitating language modeling. Based on the proposed speech codec, Vec-Tok Speech leverages an LM to undertake the core of speech generation. Moreover, Byte-Pair Encoding (BPE) is introduced to reduce the token length and bit rate for lower exposure bias and longer context coverage, improving the performance of LMs. Vec-Tok Speech can be used for intra- and cross-lingual zero-shot voice conversion (VC), zero-shot speaking style transfer text-to-speech (TTS), speech-to-speech translation (S2ST), speech denoising, and speaker de-identification and anonymization. Experiments show that Vec-Tok Speech, built on 50k hours of speech, performs better than other SOTA models. Code will be available at https://github.com/BakerBunker/VecTok .

Submitted to arXiv on 11 Oct. 2023

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2310.07246v2

This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

In recent years, language models (LMs) have seen significant advancements in natural language processing and computer vision, producing high-quality texts and images across various tasks. However, speech generative models have faced challenges in achieving comparable speech quality and task generalization. To address this gap, a team of researchers including Xinfa Zhu, Yuanjun Lv, Yi Lei, Tao Li, Wendi He, Hongbin Zhou, Heng Lu, and Lei Xie have introduced Vec-Tok Speech. This innovative framework is designed to support multiple speech generation tasks by generating expressive and high-fidelity speech. Vec-Tok Speech introduces a novel approach to speech codec through the integration of speech vectors and semantic tokens. <br> Speech vectors capture acoustic details essential for reconstructing high-fidelity speech,<br> while semantic tokens focus on the linguistic content of the speech to enhance language modeling capabilities.<br> Leveraging this unique codec design,<br> Vec-Tok Speech utilizes an LM at its core for efficient speech generation.<br> Furthermore,<br> the researchers incorporate Byte-Pair Encoding (BPE) into Vec-Tok Speech to reduce token length and bit rate.<br> This optimization not only minimizes exposure bias but also enables longer context coverage,<br> thereby enhancing LM performance.<br> The versatility of Vec-Tok Speech extends to various applications such as intra- and cross-lingual zero-shot voice conversion (VC), zero-shot speaking style transfer text-to-speech (TTS), speech-to-speech translation (S2ST), speech denoising, and speaker de-identification and anonymization. Experimental results demonstrate that Vec-Tok Speech outperforms other state-of-the-art models when trained on a substantial dataset comprising 50k hours of speech data. The research team plans to make the code for Vec-Tok Speech available on GitHub at https://github.com/BakerBunker/VecTok. Overall, Vec-Tok Speech represents a significant advancement in neural speech generation technology with its focus on high-fidelity output and task flexibility across a range of applications.
Created on 16 Jan. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.