多乐器网络：从身体运动中无监督的音乐

论文标题

多乐器网络：从身体运动中无监督的音乐

Multi-Instrumentalist Net: Unsupervised Generation of Music from Body Movements

论文作者

Su, Kun, Liu, Xiulong, Shlizerman, Eli

论文摘要

我们提出了一个新颖的系统，该系统将音乐家的输入身体动作播放，并在无监督的环境中产生音乐。学习从视频中生成多乐器音乐而不标记乐器是一个具有挑战性的问题。为了实现转型，我们建立了一条名为“多乐器”（MI Net）的管道。管道在其基础上，使用量化的量化变异自动编码器（VQ-VAE）学习了各种仪器音乐的离散潜在表示音乐。然后，对管道进行了训练，并在音乐家的身体关键点动作上进行了自回归先验的培训，该动作由经常性的神经网络编码。与人体运动编码器的联合训练成功地将音乐分解为潜在特征，以表明音乐组件和乐器功能。潜在空间会导致分布聚集在可以生成新音乐的不同仪器中。此外，VQ-VAE体系结构还通过其他条件来支持详细的音乐生成。我们表明，MIDI可以进一步调节潜在空间，以便管道将在视频中产生乐器播放的音乐的确切内容。我们在两个数据集上评估了MI NET，其中包含13个乐器的视频，并获得具有合理音频质量的生成音乐，这些音乐很容易与相应的仪器相关联，并且与音乐音频内容一致。

We propose a novel system that takes as an input body movements of a musician playing a musical instrument and generates music in an unsupervised setting. Learning to generate multi-instrumental music from videos without labeling the instruments is a challenging problem. To achieve the transformation, we built a pipeline named 'Multi-instrumentalistNet' (MI Net). At its base, the pipeline learns a discrete latent representation of various instruments music from log-spectrogram using a Vector Quantized Variational Autoencoder (VQ-VAE) with multi-band residual blocks. The pipeline is then trained along with an autoregressive prior conditioned on the musician's body keypoints movements encoded by a recurrent neural network. Joint training of the prior with the body movements encoder succeeds in the disentanglement of the music into latent features indicating the musical components and the instrumental features. The latent space results in distributions that are clustered into distinct instruments from which new music can be generated. Furthermore, the VQ-VAE architecture supports detailed music generation with additional conditioning. We show that a Midi can further condition the latent space such that the pipeline will generate the exact content of the music being played by the instrument in the video. We evaluate MI Net on two datasets containing videos of 13 instruments and obtain generated music of reasonable audio quality, easily associated with the corresponding instrument, and consistent with the music audio content.

下载PDF全文

下载文献需遵守相关版权规定

论文标题