Music Source Separation with Band-Split RoPE Transformer ...
Music source separation (MSS) aims to separate a music recording into multiple musically distinct stems, such as vocals, bass, drums, and more. Recently, deep learning approaches such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs) have been used, but the improvement is...
Main Authors: | , , , |
---|---|
Format: | Article in Journal/Newspaper |
Language: | unknown |
Published: |
arXiv
2023
|
Subjects: | |
Online Access: | https://dx.doi.org/10.48550/arxiv.2309.02612 https://arxiv.org/abs/2309.02612 |
id |
ftdatacite:10.48550/arxiv.2309.02612 |
---|---|
record_format |
openpolar |
spelling |
ftdatacite:10.48550/arxiv.2309.02612 2023-11-05T03:44:50+01:00 Music Source Separation with Band-Split RoPE Transformer ... Lu, Wei-Tsung Wang, Ju-Chiang Kong, Qiuqiang Hung, Yun-Ning 2023 https://dx.doi.org/10.48550/arxiv.2309.02612 https://arxiv.org/abs/2309.02612 unknown arXiv Creative Commons Attribution 4.0 International https://creativecommons.org/licenses/by/4.0/legalcode cc-by-4.0 Sound cs.SD Audio and Speech Processing eess.AS FOS Computer and information sciences FOS Electrical engineering, electronic engineering, information engineering Article article CreativeWork Preprint 2023 ftdatacite https://doi.org/10.48550/arxiv.2309.02612 2023-10-09T10:53:02Z Music source separation (MSS) aims to separate a music recording into multiple musically distinct stems, such as vocals, bass, drums, and more. Recently, deep learning approaches such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs) have been used, but the improvement is still limited. In this paper, we propose a novel frequency-domain approach based on a Band-Split RoPE Transformer (called BS-RoFormer). BS-RoFormer relies on a band-split module to project the input complex spectrogram into subband-level representations, and then arranges a stack of hierarchical Transformers to model the inner-band as well as inter-band sequences for multi-band mask estimation. To facilitate training the model for MSS, we propose to use the Rotary Position Embedding (RoPE). The BS-RoFormer system trained on MUSDB18HQ and 500 extra songs ranked the first place in the MSS track of Sound Demixing Challenge (SDX23). Benchmarking a smaller version of BS-RoFormer on MUSDB18HQ, we achieve ... : This paper explains the SAMI-ByteDance MSS system submitted to Sound Demixing Challenge (SDX23) Music Separation Track. Version 2 of paper fixed some typos ... Article in Journal/Newspaper sami DataCite Metadata Store (German National Library of Science and Technology) |
institution |
Open Polar |
collection |
DataCite Metadata Store (German National Library of Science and Technology) |
op_collection_id |
ftdatacite |
language |
unknown |
topic |
Sound cs.SD Audio and Speech Processing eess.AS FOS Computer and information sciences FOS Electrical engineering, electronic engineering, information engineering |
spellingShingle |
Sound cs.SD Audio and Speech Processing eess.AS FOS Computer and information sciences FOS Electrical engineering, electronic engineering, information engineering Lu, Wei-Tsung Wang, Ju-Chiang Kong, Qiuqiang Hung, Yun-Ning Music Source Separation with Band-Split RoPE Transformer ... |
topic_facet |
Sound cs.SD Audio and Speech Processing eess.AS FOS Computer and information sciences FOS Electrical engineering, electronic engineering, information engineering |
description |
Music source separation (MSS) aims to separate a music recording into multiple musically distinct stems, such as vocals, bass, drums, and more. Recently, deep learning approaches such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs) have been used, but the improvement is still limited. In this paper, we propose a novel frequency-domain approach based on a Band-Split RoPE Transformer (called BS-RoFormer). BS-RoFormer relies on a band-split module to project the input complex spectrogram into subband-level representations, and then arranges a stack of hierarchical Transformers to model the inner-band as well as inter-band sequences for multi-band mask estimation. To facilitate training the model for MSS, we propose to use the Rotary Position Embedding (RoPE). The BS-RoFormer system trained on MUSDB18HQ and 500 extra songs ranked the first place in the MSS track of Sound Demixing Challenge (SDX23). Benchmarking a smaller version of BS-RoFormer on MUSDB18HQ, we achieve ... : This paper explains the SAMI-ByteDance MSS system submitted to Sound Demixing Challenge (SDX23) Music Separation Track. Version 2 of paper fixed some typos ... |
format |
Article in Journal/Newspaper |
author |
Lu, Wei-Tsung Wang, Ju-Chiang Kong, Qiuqiang Hung, Yun-Ning |
author_facet |
Lu, Wei-Tsung Wang, Ju-Chiang Kong, Qiuqiang Hung, Yun-Ning |
author_sort |
Lu, Wei-Tsung |
title |
Music Source Separation with Band-Split RoPE Transformer ... |
title_short |
Music Source Separation with Band-Split RoPE Transformer ... |
title_full |
Music Source Separation with Band-Split RoPE Transformer ... |
title_fullStr |
Music Source Separation with Band-Split RoPE Transformer ... |
title_full_unstemmed |
Music Source Separation with Band-Split RoPE Transformer ... |
title_sort |
music source separation with band-split rope transformer ... |
publisher |
arXiv |
publishDate |
2023 |
url |
https://dx.doi.org/10.48550/arxiv.2309.02612 https://arxiv.org/abs/2309.02612 |
genre |
sami |
genre_facet |
sami |
op_rights |
Creative Commons Attribution 4.0 International https://creativecommons.org/licenses/by/4.0/legalcode cc-by-4.0 |
op_doi |
https://doi.org/10.48550/arxiv.2309.02612 |
_version_ |
1781705743553003520 |