论文标题

音乐源分离的混合变压器

Hybrid Transformers for Music Source Separation

论文作者

Rouard, Simon, Massa, Francisco, Défossez, Alexandre

论文摘要

音乐源分离(MSS)引起的一个自然问题是远距离上下文信息是否有用,或者本地声学特征是否足够。在其他领域,基于注意力的变形金刚显示了它们通过长序列整合信息的能力。在这项工作中,我们介绍了混合变压器DEMUCS(HT DEMUCS),这是一种基于混合Demucs的混合时间/光谱BI-U-NET,其中最内层的层被一个跨域变压器编码器代替,使用一个域内的跨域变压器编码器代替,并在一个域内进行自我关注,并在跨域中进行了跨注意。虽然仅在MUSDB上接受培训时的表现较差,但我们表明,当使用800张额外的培训歌曲时,它的表现优于混合示波器(对相同数据培训)的SDR。使用稀疏的注意内核扩展其接受场,并通过微调进行微调,并使用额外的训练数据在MUSDB上实现最新的结果,并使用9.20 dB的SDR实现。

A natural question arising in Music Source Separation (MSS) is whether long range contextual information is useful, or whether local acoustic features are sufficient. In other fields, attention based Transformers have shown their ability to integrate information over long sequences. In this work, we introduce Hybrid Transformer Demucs (HT Demucs), an hybrid temporal/spectral bi-U-Net based on Hybrid Demucs, where the innermost layers are replaced by a cross-domain Transformer Encoder, using self-attention within one domain, and cross-attention across domains. While it performs poorly when trained only on MUSDB, we show that it outperforms Hybrid Demucs (trained on the same data) by 0.45 dB of SDR when using 800 extra training songs. Using sparse attention kernels to extend its receptive field, and per source fine-tuning, we achieve state-of-the-art results on MUSDB with extra training data, with 9.20 dB of SDR.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源