XDBERT：将视觉信息蒸馏到跨模式系统中以提高语言理解

论文标题

XDBERT：将视觉信息蒸馏到跨模式系统中以提高语言理解

XDBERT: Distilling Visual Information to BERT from Cross-Modal Systems to Improve Language Understanding

论文作者

Hsu, Chan-Jan, Lee, Hung-yi, Tsao, Yu

论文摘要

基于变压器的模型被广泛用于自然语言理解（NLU）任务，多模式变压器在视觉语言任务中有效。这项研究探讨了从预算的多峰变压器到验证的语言编码器的蒸馏视觉信息。我们的框架的灵感来自跨模式编码者在视觉语言任务中的成功，而我们改变了学习目标，以迎合NLU的语言繁重特征。在培训少量额外的适应步骤和填补的训练之后，拟议的XDBERT（跨模式蒸馏的BERT）在一般语言理解评估（胶水），具有对抗性世代（SWAG）基准标准的情况（赃物）和可读性基准的情况下优于验证的 - 伯特。我们分析了Xdbert在GLUE上的性能，以表明该改进很可能是视觉扎根的。

Transformer-based models are widely used in natural language understanding (NLU) tasks, and multimodal transformers have been effective in visual-language tasks. This study explores distilling visual information from pretrained multimodal transformers to pretrained language encoders. Our framework is inspired by cross-modal encoders' success in visual-language tasks while we alter the learning objective to cater to the language-heavy characteristics of NLU. After training with a small number of extra adapting steps and finetuned, the proposed XDBERT (cross-modal distilled BERT) outperforms pretrained-BERT in general language understanding evaluation (GLUE), situations with adversarial generations (SWAG) benchmarks, and readability benchmarks. We analyze the performance of XDBERT on GLUE to show that the improvement is likely visually grounded.

下载PDF全文

下载文献需遵守相关版权规定

论文标题