Uniformerv2：通过武装图像VIT与视频统一器进行的时空学习

论文标题

Uniformerv2：通过武装图像VIT与视频统一器进行的时空学习

UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer

论文作者

Li, Kunchang, Wang, Yali, He, Yinan, Li, Yizhuo, Wang, Yi, Wang, Limin, Qiao, Yu

论文摘要

学习判别时空表示是视频理解的关键问题。最近，视觉变形金刚（VIT）表现出了他们在学习长期视频依赖与自我注意力方面的力量。不幸的是，由于令牌之间的盲目比较，它们在解决本地视频冗余方面表现出局限性。统一者通过将卷积和自我意见统一为变压器形式的关系聚合器，成功地缓解了这个问题。但是，该模型必须需要一个令人讨厌且复杂的预言短语，然后才能在视频上进行填补。这在实践中阻碍了其广泛的用法。相反，开源的VIT容易获得，并通过丰富的图像监督进行了充分的预言。基于这些观察结果，我们提出了一个通用范式来建立一个强大的视频网络家族，通过用高效的统一设计武装预验证的VIT。我们称这个家庭Uniformerv2，因为它继承了统一块的简洁风格。但是它包含了全新的本地和全球关系聚合器，通过无缝整合VIT和统一者的优势，可以使精确兼容的平衡达到可取的精度衡量平衡。没有任何铃铛和哨子，我们的Uniformerv2在8个流行的视频基准上获得了最先进的识别性能，包括与场景相关的动力学-400/600/700以及时间的瞬间，与时间有关的事物，与时间有关的V1/V2，未磨损的活动网和HACS。据我们所知，这是第一个在Kinetics-400上获得90％TOP-1准确性的模型。代码将在https://github.com/opengvlab/uniformerv2上找到。

Learning discriminative spatiotemporal representation is the key problem of video understanding. Recently, Vision Transformers (ViTs) have shown their power in learning long-term video dependency with self-attention. Unfortunately, they exhibit limitations in tackling local video redundancy, due to the blind global comparison among tokens. UniFormer has successfully alleviated this issue, by unifying convolution and self-attention as a relation aggregator in the transformer format. However, this model has to require a tiresome and complicated image-pretraining phrase, before being finetuned on videos. This blocks its wide usage in practice. On the contrary, open-sourced ViTs are readily available and well-pretrained with rich image supervision. Based on these observations, we propose a generic paradigm to build a powerful family of video networks, by arming the pretrained ViTs with efficient UniFormer designs. We call this family UniFormerV2, since it inherits the concise style of the UniFormer block. But it contains brand-new local and global relation aggregators, which allow for preferable accuracy-computation balance by seamlessly integrating advantages from both ViTs and UniFormer. Without any bells and whistles, our UniFormerV2 gets the state-of-the-art recognition performance on 8 popular video benchmarks, including scene-related Kinetics-400/600/700 and Moments in Time, temporal-related Something-Something V1/V2, untrimmed ActivityNet and HACS. In particular, it is the first model to achieve 90% top-1 accuracy on Kinetics-400, to our best knowledge. Code will be available at https://github.com/OpenGVLab/UniFormerV2.

下载PDF全文

下载文献需遵守相关版权规定

论文标题