论文标题
无监督的口语术语Discovery基于与Siamese和Triplet Networks的假设语音段的重新关注
Unsupervised Spoken Term Discovery Based on Re-clustering of Hypothesized Speech Segments with Siamese and Triplet Networks
论文作者
论文摘要
从未转录的语音音频中发现的口语术语可以通过两个阶段的过程来实现。在第一阶段,未标记的语音被解码为一系列以无监督方式学习和建模的子单词单元。在第二阶段,部分序列匹配和聚类是在解码的子字序列上执行的,从而产生了一组发现的单词或短语。这种方法的一个限制是子词解码的结果可能是错误的,并且错误会影响后续步骤。尽管暹罗/三胞胎网络是一种学习可以改善发现过程的细分表示的方法,但在完全无监督的情况下,口语术语发现的挑战是培训示例无法获得。在本文中,我们建议从初始假设的序列簇中生成训练示例。对假设的示例训练了暹罗/三胞胎网络,以衡量两个语音段之间的相似性,并在此对所有假设的子单词序列进行重新群体的重新群体,以实现口语术语发现。实验结果表明,所提出的方法可有效地获得暹罗和三重态网络的培训示例,从而提高了与原始两阶段方法相比,口语术语发现的功效。
Spoken term discovery from untranscribed speech audio could be achieved via a two-stage process. In the first stage, the unlabelled speech is decoded into a sequence of subword units that are learned and modelled in an unsupervised manner. In the second stage, partial sequence matching and clustering are performed on the decoded subword sequences, resulting in a set of discovered words or phrases. A limitation of this approach is that the results of subword decoding could be erroneous, and the errors would impact the subsequent steps. While Siamese/Triplet network is one approach to learn segment representations that can improve the discovery process, the challenge in spoken term discovery under a complete unsupervised scenario is that training examples are unavailable. In this paper, we propose to generate training examples from initial hypothesized sequence clusters. The Siamese/Triplet network is trained on the hypothesized examples to measure the similarity between two speech segments and hereby perform re-clustering of all hypothesized subword sequences to achieve spoken term discovery. Experimental results show that the proposed approach is effective in obtaining training examples for Siamese and Triplet networks, improving the efficacy of spoken term discovery as compared with the original two-stage method.