论文标题
视力语言预训练:基础知识,最新进步和未来趋势
Vision-Language Pre-training: Basics, Recent Advances, and Future Trends
论文作者
论文摘要
本文调查了过去几年中开发的多模式智能的视觉预训练(VLP)方法。我们将这些方法分为三类:($ i $)VLP用于图像文本任务,例如图像字幕,图像文本检索,视觉询问答案和视觉接地; ($ ii $)用于核心计算机视觉任务的VLP,例如(开放设置)图像分类,对象检测和细分; ($ iii $)VLP用于视频文本任务,例如视频字幕,视频文本检索和视频问题回答。对于每个类别,我们都会对最新方法进行全面的审查,并使用特定的系统和模型作为案例研究,讨论所取得的进展以及仍面临的挑战。此外,对于每个类别,我们讨论了在研究社区中积极探索的高级主题,例如大型基础模型,统一的建模,在野外少数图,知识,稳健性和计算机愿景,以等待一些。
This paper surveys vision-language pre-training (VLP) methods for multimodal intelligence that have been developed in the last few years. We group these approaches into three categories: ($i$) VLP for image-text tasks, such as image captioning, image-text retrieval, visual question answering, and visual grounding; ($ii$) VLP for core computer vision tasks, such as (open-set) image classification, object detection, and segmentation; and ($iii$) VLP for video-text tasks, such as video captioning, video-text retrieval, and video question answering. For each category, we present a comprehensive review of state-of-the-art methods, and discuss the progress that has been made and challenges still being faced, using specific systems and models as case studies. In addition, for each category, we discuss advanced topics being actively explored in the research community, such as big foundation models, unified modeling, in-context few-shot learning, knowledge, robustness, and computer vision in the wild, to name a few.