探索扬声器入学率在情感发声预测中几乎没有个性化的个性化

论文标题

探索扬声器入学率在情感发声预测中几乎没有个性化的个性化

Exploring speaker enrolment for few-shot personalisation in emotional vocalisation prediction

论文作者

Triantafyllopoulos, Andreas, Song, Meishu, Yang, Zijiang, Jing, Xin, Schuller, Björn W.

论文摘要

在这项工作中，我们探索了一种小说的几个个性化架构，以进行情感发声预测。核心贡献是一个“注册”编码器，它利用目标扬声器的两个未标记的样本来调整情感编码器的输出。调整基于点产生的注意力，因此有效地充当“软”特征选择的一种形式。情感和注册编码器基于两个标准音频架构：CNN14和CNN10。这两个编码器被进一步指导忘记或学习辅助情感和/或说话者信息。我们最好的方法在EXVO少量开发套件上达到了CCC $ .650 $，比我们的基线CNN14 CCC增加了$ 2.5 \％$ $ .634 $。

In this work, we explore a novel few-shot personalisation architecture for emotional vocalisation prediction. The core contribution is an `enrolment' encoder which utilises two unlabelled samples of the target speaker to adjust the output of the emotion encoder; the adjustment is based on dot-product attention, thus effectively functioning as a form of `soft' feature selection. The emotion and enrolment encoders are based on two standard audio architectures: CNN14 and CNN10. The two encoders are further guided to forget or learn auxiliary emotion and/or speaker information. Our best approach achieves a CCC of $.650$ on the ExVo Few-Shot dev set, a $2.5\%$ increase over our baseline CNN14 CCC of $.634$.

下载PDF全文

下载文献需遵守相关版权规定

论文标题