论文标题
三维唇部运动网络,用于文本独立的扬声器识别
Three-Dimensional Lip Motion Network for Text-Independent Speaker Recognition
论文作者
论文摘要
唇部运动反映了说话者的行为特征,因此可以用作说话者识别的一种新型生物识别技术。在文献中,许多作品都使用二维(2D)唇部图像在与文本依赖的上下文中识别说话者。但是,2D唇容易遭受各种面部取向。为此,在这项工作中,我们通过利用句子级别的3D唇部运动(S3DLM)来识别在与文本独立和文本相关的上下文中识别说话者的新颖端到端3D唇部运动网络(3LMNET)。提出了一个新的区域反馈模块(RFM),以获得不同唇部区域的注意。此外,研究了唇部运动的先验知识以补充RFM,其中具有里程碑意义的级别和框架级特征以形成更好的特征表示。此外,我们提出了两种方法,即坐标转换和面部姿势校正,以预处理LSD-AV数据集,其中包含68位扬声器和146个句子。该数据集上的评估结果表明,我们提出的3LMNET优于基线模型,即LSTM,VGG-16和Resnet-34,并且使用2D唇部以及3D面优于最先进的方法。这项工作的代码在https://github.com/wutong18/three-dimensional-lip-motion-network-for-text-intept-sneptent-speaker-compentition上发布。
Lip motion reflects behavior characteristics of speakers, and thus can be used as a new kind of biometrics in speaker recognition. In the literature, lots of works used two-dimensional (2D) lip images to recognize speaker in a textdependent context. However, 2D lip easily suffers from various face orientations. To this end, in this work, we present a novel end-to-end 3D lip motion Network (3LMNet) by utilizing the sentence-level 3D lip motion (S3DLM) to recognize speakers in both the text-independent and text-dependent contexts. A new regional feedback module (RFM) is proposed to obtain attentions in different lip regions. Besides, prior knowledge of lip motion is investigated to complement RFM, where landmark-level and frame-level features are merged to form a better feature representation. Moreover, we present two methods, i.e., coordinate transformation and face posture correction to pre-process the LSD-AV dataset, which contains 68 speakers and 146 sentences per speaker. The evaluation results on this dataset demonstrate that our proposed 3LMNet is superior to the baseline models, i.e., LSTM, VGG-16 and ResNet-34, and outperforms the state-of-the-art using 2D lip image as well as the 3D face. The code of this work is released at https://github.com/wutong18/Three-Dimensional-Lip- Motion-Network-for-Text-Independent-Speaker-Recognition.