LLM-Enhanced Dialogue Management for Full-Duplex Spoken Dialogue Systems

AI-generated keywords: Spoken Dialogue Systems Full-Duplex Communication Semantic Voice Activity Detection Language Model Real-Time Decision-Making

AI-generated Key Points

The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

  • Challenges of achieving full-duplex communication in spoken dialogue systems (SDS)
  • Proposal of a semantic voice activity detection (VAD) module as a dialogue manager (DM)
  • Real-time coordination between listening, speaking, and thinking processes is crucial for successful full-duplex communication
  • Implementation of a lightweight LLM fine-tuned on full-duplex conversation data for the semantic VAD module
  • Prediction of four control tokens to regulate turn-switching and turn-keeping during conversations
  • Ability to distinguish between intentional and unintentional interruptions while detecting query completion
  • Processing input speech in short intervals for real-time decision-making without activating the core dialogue engine until response generation is required
  • Reduction of computational overhead and independent optimization of the dialogue manager without retraining the core dialogue engine
  • Striking a balance between interaction accuracy and inference efficiency in full-duplex SDS
  • Paving the way for scalable next-generation spoken dialogue systems with efficient management of turn-taking dynamics for seamless communication between users and machines
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Hao Zhang, Weiwei Li, Rilin Chen, Vinay Kothapally, Meng Yu, Dong Yu

In submission to INTERSPEECH 2025

Abstract: Achieving full-duplex communication in spoken dialogue systems (SDS) requires real-time coordination between listening, speaking, and thinking. This paper proposes a semantic voice activity detection (VAD) module as a dialogue manager (DM) to efficiently manage turn-taking in full-duplex SDS. Implemented as a lightweight (0.5B) LLM fine-tuned on full-duplex conversation data, the semantic VAD predicts four control tokens to regulate turn-switching and turn-keeping, distinguishing between intentional and unintentional barge-ins while detecting query completion for handling user pauses and hesitations. By processing input speech in short intervals, the semantic VAD enables real-time decision-making, while the core dialogue engine (CDE) is only activated for response generation, reducing computational overhead. This design allows independent DM optimization without retraining the CDE, balancing interaction accuracy and inference efficiency for scalable, next-generation full-duplex SDS.

Submitted to arXiv on 19 Feb. 2025

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2502.14145v2

This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

The paper "LLM-Enhanced Dialogue Management for Full-Duplex Spoken Dialogue Systems" by Hao Zhang, Weiwei Li, Rilin Chen, Vinay Kothapally, Meng Yu, and Dong Yu explores the challenges of achieving full-duplex communication in spoken dialogue systems (SDS) and proposes a novel solution using a semantic voice activity detection (VAD) module as a dialogue manager (DM). The key to successful full-duplex communication lies in real-time coordination between listening, speaking, and thinking processes. The proposed semantic VAD module is implemented as a lightweight LLM (Language Model) fine-tuned on full-duplex conversation data. This module predicts four control tokens that regulate turn-switching and turn-keeping during conversations. It can distinguish between intentional and unintentional interruptions (barge-ins) while also detecting query completion to handle user pauses and hesitations effectively. By processing input speech in short intervals, the semantic VAD enables real-time decision-making without activating the core dialogue engine (CDE) until it is time for response generation. This approach significantly reduces computational overhead and allows for independent optimization of the dialogue manager without the need to retrain the core dialogue engine. Overall, this innovative design strikes a balance between interaction accuracy and inference efficiency in full-duplex SDS. It paves the way for scalable next-generation spoken dialogue systems that can efficiently manage turn-taking dynamics while ensuring seamless communication between users and machines. In conclusion, this paper presents an innovative solution for achieving full-duplex communication in spoken dialogue systems through the use of a semantic voice activity detection module as a dialogue manager. By implementing this lightweight LLM model trained on full-duplex conversation data, it enables real-time decision-making while reducing computational overhead. This approach allows for independent optimization of the dialogue manager and paves the way for more natural and intuitive interactions between users and machines.
Created on 11 Nov. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.