通过自残留注意力指导的异质翻译器标记为音频综合的标记-MRI序列

论文标题

通过自残留注意力指导的异质翻译器标记为音频综合的标记-MRI序列

Tagged-MRI Sequence to Audio Synthesis via Self Residual Attention Guided Heterogeneous Translator

论文作者

Liu, Xiaofeng, Xing, Fangxu, Prince, Jerry L., Zhuo, Jiachen, Stone, Maureen, Fakhri, Georges El, Woo, Jonghye

论文摘要

了解舌头与口咽肌肉变形之间的潜在关系在标记的MRI和可理解的语音中起着重要的作用，在推进语音运动控制理论和对语音相关疾病的处理中起着重要作用。但是，由于它们的异质表示形式，两种方式之间的直接映射（即二维（左右切片）加上时间标记的MRI序列及其相应的一维波形）并不简单。取而代之的是，我们诉诸二维频谱图作为中间表示，其中包含音高和共鸣，从中开发一个端到端的深度学习框架，以从一系列标记的MRI转换为相应的音频波形，并具有有限的数据集大小。语音。〜此外，我们利用样品具有与潜在空间表示相同的话语的成对相关性。音频波形来自一系列标记的MRI，超过竞争方法。因此，我们的框架为帮助更好地了解两种方式之间的关系提供了巨大的潜力。

Understanding the underlying relationship between tongue and oropharyngeal muscle deformation seen in tagged-MRI and intelligible speech plays an important role in advancing speech motor control theories and treatment of speech related-disorders. Because of their heterogeneous representations, however, direct mapping between the two modalities -- i.e., two-dimensional (mid-sagittal slice) plus time tagged-MRI sequence and its corresponding one-dimensional waveform -- is not straightforward. Instead, we resort to two-dimensional spectrograms as an intermediate representation, which contains both pitch and resonance, from which to develop an end-to-end deep learning framework to translate from a sequence of tagged-MRI to its corresponding audio waveform with limited dataset size.~Our framework is based on a novel fully convolutional asymmetry translator with guidance of a self residual attention strategy to specifically exploit the moving muscular structures during speech.~In addition, we leverage a pairwise correlation of the samples with the same utterances with a latent space representation disentanglement strategy.~Furthermore, we incorporate an adversarial training approach with generative adversarial networks to offer improved realism on our generated spectrograms.~Our experimental results, carried out with a total of 63 tagged-MRI sequences alongside speech acoustics, showed that our framework enabled the generation of clear audio waveforms from a sequence of tagged-MRI, surpassing competing methods. Thus, our framework provides the great potential to help better understand the relationship between the two modalities.

下载PDF全文

下载文献需遵守相关版权规定

论文标题