A Stem-Agnostic Single-Decoder System for Music Source Separation Beyond Four Stems

AI-generated keywords: Audio source separation

AI-generated Key Points

Significant progress in audio source separation, particularly in separating vocals, drums, bass, and other (VDBO) stems
Existing systems are limited to four-stem setup with inflexible decoder configurations
Introduction of Banquet as a stem-agnostic single-decoder system for effectively separating multiple stems while maintaining computational feasibility
Banquet achieved comparable performance to complex systems on VDBO stems and outperformed them on guitar and piano separations
Ability of Banquet to successfully extract narrow instrument classes such as clean acoustic guitars and less common stems like reeds and organs with only 24.9 million trainable parameters
Experiments on the MoisesDB dataset showed that Banquet's query-based setup enables fine-level stem separations beyond traditional VDBO categories
Availability of Banquet's implementation for further exploration and application in audio processing tasks at https://github.com/kwatcharasupat/query-bandit

Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Karn N. Watcharasupat, Alexander Lerch

arXiv: 2406.18747v1 - DOI (cs.SD)

Submitted to the 25th International Society for Music Information Retrieval Conference (ISMIR 2024)

License: CC BY-NC-SA 4.0

Abstract: Despite significant recent progress across multiple subtasks of audio source separation, few music source separation systems support separation beyond the four-stem vocals, drums, bass, and other (VDBO) setup. Of the very few current systems that support source separation beyond this setup, most continue to rely on an inflexible decoder setup that can only support a fixed pre-defined set of stems. Increasing stem support in these inflexible systems correspondingly requires increasing computational complexity, rendering extensions of these systems computationally infeasible for long-tail instruments. In this work, we propose Banquet, a system that allows source separation of multiple stems using just one decoder. A bandsplit source separation model is extended to work in a query-based setup in tandem with a music instrument recognition PaSST model. On the MoisesDB dataset, Banquet, at only 24.9 M trainable parameters, approached the performance level of the significantly more complex 6-stem Hybrid Transformer Demucs on VDBO stems and outperformed it on guitar and piano. The query-based setup allows for the separation of narrow instrument classes such as clean acoustic guitars, and can be successfully applied to the extraction of less common stems such as reeds and organs. Implementation is available at https://github.com/kwatcharasupat/query-bandit.

Submitted to arXiv on 26 Jun. 2024

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2406.18747v1

Comprehensive Summary
Key points
Layman's Summary
Blog article

, , , , In the field of audio source separation, significant progress has been made in recent years. One area of focus has been on separating vocals, drums, bass, and other (VDBO) stems. However, most existing systems are limited to this four-stem setup and rely on inflexible decoder configurations that cannot handle a variable number of stems. To address this limitation, a new system called Banquet has been proposed. is a stem-agnostic single-decoder system that can effectively separate multiple stems while maintaining computational feasibility. It employs a bandsplit source separation model and a music instrument recognition PaSST model in a query-based setup. Remarkably, Banquet achieved comparable performance to more complex systems on VDBO stems and even outperformed them on guitar and piano separations. One key advantage of Banquet is its ability to successfully extract narrow instrument classes such as clean acoustic guitars and less common stems like reeds and organs. This flexibility is made possible by the use of only 24.9 million trainable parameters, making it computationally efficient compared to other systems. Experiments conducted on the MoisesDB dataset demonstrated that Banquet's query-based setup enables the extraction of fine-level stem separations beyond traditional VDBO categories. This opens up possibilities for separating various instruments with high precision without compromising computational efficiency. Overall, Banquet represents a significant advancement in music source separation technology and its implementation is publicly available for further exploration and application in audio processing tasks at https://github.com/kwatcharasupat/query-bandit.

- Significant progress in audio source separation, particularly in separating vocals, drums, bass, and other (VDBO) stems
- Existing systems are limited to four-stem setup with inflexible decoder configurations
- Introduction of Banquet as a stem-agnostic single-decoder system for effectively separating multiple stems while maintaining computational feasibility
- Banquet achieved comparable performance to complex systems on VDBO stems and outperformed them on guitar and piano separations
- Ability of Banquet to successfully extract narrow instrument classes such as clean acoustic guitars and less common stems like reeds and organs with only 24.9 million trainable parameters
- Experiments on the MoisesDB dataset showed that Banquet's query-based setup enables fine-level stem separations beyond traditional VDBO categories
- Availability of Banquet's implementation for further exploration and application in audio processing tasks at https://github.com/kwatcharasupat/query-bandit

Summary1. People have made big improvements in separating different parts of music like singing, drums, and bass. 2. Current systems can only separate music into four parts and are not very flexible. 3. Banquet is a new system that can separate many different parts of music well without being too complicated. 4. Banquet works as well as complex systems for some parts of music and even better for guitar and piano. 5. Banquet can find specific instruments in music with fewer settings than other systems. Definitions- Audio source separation: The process of isolating different sounds or instruments in a piece of music. - Stems: Individual tracks or layers that make up a complete piece of music. - Computational feasibility: The ability to perform tasks using computer resources effectively. - Parameters: Settings or variables used to control how a system works. - Dataset: A collection of data used for testing and training algorithms.

Introduction

Audio source separation, also known as blind source separation or sound source separation, is a technique used to extract individual audio sources from a mixed signal. It has been an active area of research in the field of audio processing for many years. One specific focus within this field has been on separating vocals, drums, bass, and other (VDBO) stems from music recordings. However, most existing systems are limited to this four-stem setup and rely on inflexible decoder configurations that cannot handle a variable number of stems. To address this limitation, researchers at Google AI have proposed a new system called Banquet [1]. This stem-agnostic single-decoder system can effectively separate multiple stems while maintaining computational feasibility. In this article, we will delve into the details of Banquet and its potential impact on the field of audio source separation.

The Need for Stem-Agnostic Separation Systems

Traditional VDBO stem separations are limited in their ability to handle variable numbers of stems due to their reliance on fixed decoder configurations. This means that if there are more than four stems in a recording or if there are different types of instruments present in addition to VDBO categories, these systems struggle to accurately separate them. This limitation hinders the potential applications of audio source separation technology in various fields such as music production and remixing. For example, producers may want to isolate specific instruments or sounds from a mixed track for sampling or remixing purposes. With traditional VDBO stem separations, this would not be possible if the desired instrument falls outside the four-stem setup.

Banquet: A Stem-Agnostic Single-Decoder System

Banquet addresses these limitations by introducing a novel approach that combines two models - bandsplit source separation model and music instrument recognition PaSST model - in a query-based setup. This allows for the extraction of fine-level stem separations beyond traditional VDBO categories. The bandsplit source separation model is responsible for separating the audio signal into different frequency bands, while the music instrument recognition PaSST model identifies and separates individual instruments within each band. The use of a query-based setup enables Banquet to handle a variable number of stems without compromising computational efficiency.

Performance Evaluation

To evaluate the performance of Banquet, experiments were conducted on the MoisesDB dataset [2], which contains 1000 songs with 10 different stems each. The results showed that Banquet achieved comparable performance to more complex systems on VDBO stems and even outperformed them on guitar and piano separations. One key advantage of Banquet is its ability to successfully extract narrow instrument classes such as clean acoustic guitars and less common stems like reeds and organs. This flexibility is made possible by using only 24.9 million trainable parameters, making it computationally efficient compared to other systems.

Future Applications

Banquet's query-based setup opens up possibilities for separating various instruments with high precision without compromising computational efficiency. This has potential applications in fields such as music production, remixing, and automatic transcription where accurate separation of individual instruments is crucial. Additionally, Banquet's implementation is publicly available at https://github.com/kwatcharasupat/query-bandit, allowing for further exploration and application in audio processing tasks beyond just source separation.

Conclusion

In conclusion, Banquet represents a significant advancement in music source separation technology by introducing a stem-agnostic single-decoder system that can effectively separate multiple stems while maintaining computational feasibility. Its ability to handle a variable number of stems and extract fine-level stem separations makes it stand out from existing systems in this field. With its promising results on various types of instruments, Banquet has the potential to revolutionize the way we approach audio source separation and open up new possibilities for its applications.

Created on 15 Mar. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Similar papers summarized with our AI tools

55.1%

Melody Extraction from Polyphonic Music by Deep Learning Approaches: A Review

cs.SD

53.9%

Melody transcription via generative pre-training

cs.SD

53.6%

HierSpeech++: Bridging the Gap between Semantic and Acoustic Representation o…

cs.SD

53.6%

Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Sep…

cs.SD

53.2%

Localization, Detection and Tracking of Multiple Moving Sound Sources with a …

cs.SD

52.8%

LLark: A Multimodal Foundation Model for Music

cs.SD

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.