Music Source Separation with Band-Split RoPE Transformer

Music source separation (MSS) aims to separate a music recording into multiple musically distinct stems, such as vocals, bass, drums, and more. Recently, deep learning approaches such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs) have been used, but the improvement is...

Full description

Bibliographic Details
Main Authors:	Lu, Wei-Tsung, Wang, Ju-Chiang, Kong, Qiuqiang, Hung, Yun-Ning
Format:	Text
Language:	unknown
Published:	2023
Subjects:	Computer Science - Sound Electrical Engineering and Systems Science - Audio and Speech Processing sami
Online Access:	http://arxiv.org/abs/2309.02612

_version_	1821699975908687872
author	Lu, Wei-Tsung Wang, Ju-Chiang Kong, Qiuqiang Hung, Yun-Ning
author_facet	Lu, Wei-Tsung Wang, Ju-Chiang Kong, Qiuqiang Hung, Yun-Ning
author_sort	Lu, Wei-Tsung
collection	ArXiv.org (Cornell University Library)
description	Music source separation (MSS) aims to separate a music recording into multiple musically distinct stems, such as vocals, bass, drums, and more. Recently, deep learning approaches such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs) have been used, but the improvement is still limited. In this paper, we propose a novel frequency-domain approach based on a Band-Split RoPE Transformer (called BS-RoFormer). BS-RoFormer relies on a band-split module to project the input complex spectrogram into subband-level representations, and then arranges a stack of hierarchical Transformers to model the inner-band as well as inter-band sequences for multi-band mask estimation. To facilitate training the model for MSS, we propose to use the Rotary Position Embedding (RoPE). The BS-RoFormer system trained on MUSDB18HQ and 500 extra songs ranked the first place in the MSS track of Sound Demixing Challenge (SDX23). Benchmarking a smaller version of BS-RoFormer on MUSDB18HQ, we achieve state-of-the-art result without extra training data, with 9.80 dB of average SDR. Comment: This paper explains the SAMI-ByteDance MSS system submitted to Sound Demixing Challenge (SDX23) Music Separation Track. Version 2 of paper fixed some typos
format	Text
genre	sami
genre_facet	sami
id	ftarxivpreprints:oai:arXiv.org:2309.02612
institution	Open Polar
language	unknown
op_collection_id	ftarxivpreprints
op_relation	http://arxiv.org/abs/2309.02612
publishDate	2023
record_format	openpolar
spelling	ftarxivpreprints:oai:arXiv.org:2309.02612 2025-01-17T00:37:22+00:00 Music Source Separation with Band-Split RoPE Transformer Lu, Wei-Tsung Wang, Ju-Chiang Kong, Qiuqiang Hung, Yun-Ning 2023-09-05 http://arxiv.org/abs/2309.02612 unknown http://arxiv.org/abs/2309.02612 Computer Science - Sound Electrical Engineering and Systems Science - Audio and Speech Processing text 2023 ftarxivpreprints 2023-09-17T01:06:07Z Music source separation (MSS) aims to separate a music recording into multiple musically distinct stems, such as vocals, bass, drums, and more. Recently, deep learning approaches such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs) have been used, but the improvement is still limited. In this paper, we propose a novel frequency-domain approach based on a Band-Split RoPE Transformer (called BS-RoFormer). BS-RoFormer relies on a band-split module to project the input complex spectrogram into subband-level representations, and then arranges a stack of hierarchical Transformers to model the inner-band as well as inter-band sequences for multi-band mask estimation. To facilitate training the model for MSS, we propose to use the Rotary Position Embedding (RoPE). The BS-RoFormer system trained on MUSDB18HQ and 500 extra songs ranked the first place in the MSS track of Sound Demixing Challenge (SDX23). Benchmarking a smaller version of BS-RoFormer on MUSDB18HQ, we achieve state-of-the-art result without extra training data, with 9.80 dB of average SDR. Comment: This paper explains the SAMI-ByteDance MSS system submitted to Sound Demixing Challenge (SDX23) Music Separation Track. Version 2 of paper fixed some typos Text sami ArXiv.org (Cornell University Library)
spellingShingle	Computer Science - Sound Electrical Engineering and Systems Science - Audio and Speech Processing Lu, Wei-Tsung Wang, Ju-Chiang Kong, Qiuqiang Hung, Yun-Ning Music Source Separation with Band-Split RoPE Transformer
title	Music Source Separation with Band-Split RoPE Transformer
title_full	Music Source Separation with Band-Split RoPE Transformer
title_fullStr	Music Source Separation with Band-Split RoPE Transformer
title_full_unstemmed	Music Source Separation with Band-Split RoPE Transformer
title_short	Music Source Separation with Band-Split RoPE Transformer
title_sort	music source separation with band-split rope transformer
topic	Computer Science - Sound Electrical Engineering and Systems Science - Audio and Speech Processing
topic_facet	Computer Science - Sound Electrical Engineering and Systems Science - Audio and Speech Processing
url	http://arxiv.org/abs/2309.02612

Music Source Separation with Band-Split RoPE Transformer

Similar Items