尖锐的多个实例学习深泡视频检测

论文标题

尖锐的多个实例学习深泡视频检测

Sharp Multiple Instance Learning for DeepFake Video Detection

论文作者

Li, Xiaodan, Lang, Yining, Chen, Yuefeng, Mao, Xiaofeng, He, Yuan, Wang, Shuhui, Xue, Hui, Lu, Quan

论文摘要

随着面部操纵技术的快速发展，由于安全问题，Face Forgery在多媒体和计算机视觉社区中受到了很大的关注。现有方法主要是为单帧检测而设计的，该检测仅通过建模框架间的不一致而用于精确的图像级标签或视频级预测，从而为DeepFake攻击者留下了潜在的高风险。在本文中，我们在DeepFake视频中引入了一个新的面部攻击的新问题，其中仅提供了视频级标签，但并非采用虚假视频中的所有面孔。我们通过多个实例学习框架解决这个问题，将面孔和输入视频分别作为实例和袋子。提出了一个尖锐的MIL（S-MIL），该MIL（S-MIL）构建了从实例嵌入到袋子预测的直接映射，而不是从实例嵌入到实例预测中，然后在传统MIL中进行预测。理论分析证明，传统MIL中的梯度消失在S-MIL中可以缓解。为了生成可以准确合并部分操纵的面孔的实例，空间编码的实例旨在完全模拟框架内和框架间的不一致性，这进一步有助于促进检测性能。我们还为部分攻击的DeepFake视频检测构建了一个新的数据集FFPM，这可以使帧和视频级别的不同方法的评估受益。在FFPM和广泛使用的DFDC数据集上进行的实验证明，S-MIL优于其他攻击的DeepFake视频检测。此外，S-MIL还可以适应传统的DeepFake图像检测任务，并在单帧数据集上实现最先进的性能。

With the rapid development of facial manipulation techniques, face forgery has received considerable attention in multimedia and computer vision community due to security concerns. Existing methods are mostly designed for single-frame detection trained with precise image-level labels or for video-level prediction by only modeling the inter-frame inconsistency, leaving potential high risks for DeepFake attackers. In this paper, we introduce a new problem of partial face attack in DeepFake video, where only video-level labels are provided but not all the faces in the fake videos are manipulated. We address this problem by multiple instance learning framework, treating faces and input video as instances and bag respectively. A sharp MIL (S-MIL) is proposed which builds direct mapping from instance embeddings to bag prediction, rather than from instance embeddings to instance prediction and then to bag prediction in traditional MIL. Theoretical analysis proves that the gradient vanishing in traditional MIL is relieved in S-MIL. To generate instances that can accurately incorporate the partially manipulated faces, spatial-temporal encoded instance is designed to fully model the intra-frame and inter-frame inconsistency, which further helps to promote the detection performance. We also construct a new dataset FFPMS for partially attacked DeepFake video detection, which can benefit the evaluation of different methods at both frame and video levels. Experiments on FFPMS and the widely used DFDC dataset verify that S-MIL is superior to other counterparts for partially attacked DeepFake video detection. In addition, S-MIL can also be adapted to traditional DeepFake image detection tasks and achieve state-of-the-art performance on single-frame datasets.

下载PDF全文

下载文献需遵守相关版权规定

论文标题