WESPER: Zero-shot and Realtime Whisper to Normal Voice Conversion for Whisper-based Speech Interactions

AI-generated keywords: Speech interaction

AI-generated Key Points

The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

  • Recognizing whispered speech and converting it to normal speech opens up possibilities for semi-silent communication in public settings without disturbing others
  • Traditional methods of speech conversion often fall short in quality or rely on speaker-dependent datasets
  • WESPER is a zero-shot, real-time whisper-to-normal speech conversion mechanism using self-supervised learning techniques
  • WESPER comprises a Speech-to-Unit (STU) encoder and a Unit-to-Speech (UTS) decoder for reconstructing original speech from hidden units
  • WESPER eliminates the need for paired datasets specific to individual speakers, making the process user-independent
  • WESPER can reconstruct converted speech in any target speaker's voice using only unlabeled data from that speaker
  • WESPER enhances the quality of converted whispers while preserving their prosodic characteristics, benefiting individuals with speech or hearing impairments
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Jun Rekimoto

Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (CHI '23), April 23--28, 2023
ACM CHI 2023 paper

Abstract: Recognizing whispered speech and converting it to normal speech creates many possibilities for speech interaction. Because the sound pressure of whispered speech is significantly lower than that of normal speech, it can be used as a semi-silent speech interaction in public places without being audible to others. Converting whispers to normal speech also improves the speech quality for people with speech or hearing impairments. However, conventional speech conversion techniques do not provide sufficient conversion quality or require speaker-dependent datasets consisting of pairs of whispered and normal speech utterances. To address these problems, we propose WESPER, a zero-shot, real-time whisper-to-normal speech conversion mechanism based on self-supervised learning. WESPER consists of a speech-to-unit (STU) encoder, which generates hidden speech units common to both whispered and normal speech, and a unit-to-speech (UTS) decoder, which reconstructs speech from the encoded speech units. Unlike the existing methods, this conversion is user-independent and does not require a paired dataset for whispered and normal speech. The UTS decoder can reconstruct speech in any target speaker's voice from speech units, and it requires only an unlabeled target speaker's speech data. We confirmed that the quality of the speech converted from a whisper was improved while preserving its natural prosody. Additionally, we confirmed the effectiveness of the proposed approach to perform speech reconstruction for people with speech or hearing disabilities. (project page: http://lab.rekimoto.org/projects/wesper )

Submitted to arXiv on 03 Mar. 2023

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2303.01639v1

This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

, , , , In the field of speech interaction, recognizing whispered speech and converting it to normal speech opens up a plethora of possibilities. Whispered speech, characterized by significantly lower sound pressure compared to normal speech, can be utilized as a semi-silent form of communication in public settings without disturbing others. This not only enhances overall speech quality but also proves beneficial for individuals with speech or hearing impairments. Traditional methods of speech conversion often fall short in terms of conversion quality or rely on speaker-dependent datasets containing pairs of whispered and normal speech utterances. To overcome these limitations, a groundbreaking solution called WESPER has been introduced. WESPER stands out as a zero-shot, real-time whisper-to-normal speech conversion mechanism that leverages self-supervised learning techniques. At the core of WESPER lies a sophisticated architecture comprising a Speech-to-Unit (STU) encoder and a Unit-to-Speech (UTS) decoder. The STU encoder generates hidden speech units common to both whispered and normal speech, while the UTS decoder reconstructs the original speech from these encoded units. Notably, WESPER's innovative approach eliminates the need for paired datasets specific to individual speakers, making the conversion process user-independent. One of the key advantages of WESPER is its ability to reconstruct converted speech in any target speaker's voice using only unlabeled data from that speaker. This remarkable feature underscores the versatility and adaptability of WESPER in catering to diverse user needs without compromising on conversion quality or natural prosody. Extensive testing has confirmed that WESPER successfully enhances the quality of converted whispers while preserving their inherent prosodic characteristics. Furthermore, the proposed approach has demonstrated promising results in facilitating effective speech reconstruction for individuals with speech or hearing disabilities. Overall, WESPER represents a significant advancement in whisper-based speech interactions, offering an efficient and user-friendly solution that holds immense potential for revolutionizing communication technologies and improving accessibility for individuals with diverse communication needs.
Created on 16 Jan. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.