生成对抗性语音学：使用神经网络建模无监督的语音学习和语音学习

论文标题

生成对抗性语音学：使用神经网络建模无监督的语音学习和语音学习

Generative Adversarial Phonology: Modeling unsupervised phonetic and phonological learning with neural networks

论文作者

Beguš, Gašper

论文摘要

培训有关语音数据中充分理解的依赖性的深层神经网络可以为他们如何学习内部表示形式提供新的见解。本文认为，语音的获取可以建模为生成对抗网络体系结构中随机空间与生成语音数据之间的依赖性，并提出了一种方法，以揭示与语音和语音属性相对应的网络内部表示。生成的对抗架构非常适合建模语音和语音学习，因为网络是对未经通知的原始声学数据训练的，并且在没有任何特定语言的假设或预先提出的抽象级别的情况下，不监督学习。生成的对抗网络接受了英语的同源分布的培训。该网络成功地学习了同种异体交替：网络生成的语音信号包含抽吸持续时间的条件分布。本文提出了一种用于建立网络内部表示的技术，该技术标识了与[S]的存在及其光谱属性相对应的潜在变量。通过操纵这些变量，我们积极控制[S]的存在及其在生成的输出中的摩擦幅度。这表明网络学会使用潜在变量作为语音和语音表示的近似。至关重要的是，我们观察到，训练中学到的依赖性超出了训练间隔，这可以进行更多的学习表征探索。本文还讨论了网络的架构和创新输出与语言获取，语音障碍和语音错误的语言行为以及语音数据中的语言行为的不同，以及语音数据中的依赖性如何帮助我们解释神经网络如何学习他们的表示。

Training deep neural networks on well-understood dependencies in speech data can provide new insights into how they learn internal representations. This paper argues that acquisition of speech can be modeled as a dependency between random space and generated speech data in the Generative Adversarial Network architecture and proposes a methodology to uncover the network's internal representations that correspond to phonetic and phonological properties. The Generative Adversarial architecture is uniquely appropriate for modeling phonetic and phonological learning because the network is trained on unannotated raw acoustic data and learning is unsupervised without any language-specific assumptions or pre-assumed levels of abstraction. A Generative Adversarial Network was trained on an allophonic distribution in English. The network successfully learns the allophonic alternation: the network's generated speech signal contains the conditional distribution of aspiration duration. The paper proposes a technique for establishing the network's internal representations that identifies latent variables that correspond to, for example, presence of [s] and its spectral properties. By manipulating these variables, we actively control the presence of [s] and its frication amplitude in the generated outputs. This suggests that the network learns to use latent variables as an approximation of phonetic and phonological representations. Crucially, we observe that the dependencies learned in training extend beyond the training interval, which allows for additional exploration of learning representations. The paper also discusses how the network's architecture and innovative outputs resemble and differ from linguistic behavior in language acquisition, speech disorders, and speech errors, and how well-understood dependencies in speech data can help us interpret how neural networks learn their representations.

下载PDF全文

下载文献需遵守相关版权规定

论文标题