关于对话历史的鲁棒性在对话问题回答中的鲁棒性：一项全面的研究和一种新的基于及时的方法

论文标题

关于对话历史的鲁棒性在对话问题回答中的鲁棒性：一项全面的研究和一种新的基于及时的方法

On the Robustness of Dialogue History Representation in Conversational Question Answering: A Comprehensive Study and a New Prompt-based Method

论文作者

Gekhman, Zorik, Oved, Nadav, Keller, Orgad, Szpektor, Idan, Reichart, Roi

论文摘要

大多数在对话率答案中建模对话历史记录（CQA）的大多数作品报告了共同CQA基准测试的单一主要结果。尽管现有模型在CQA排行榜上显示出令人印象深刻的结果，但尚不清楚它们是否可以在设置（有时更现实的设置），训练数据大小（例如从大型到小型集合）和域名和域名。在这项工作中，我们设计并进行了首次针对CQA历史建模方法的大规模鲁棒性研究。我们发现，高基准分数不一定会转化为强大的鲁棒性，并且在不同的设置下，各种方法的性能都大不相同。配备了我们研究的见解，我们设计了一种基于新颖的基于及时的历史建模方法，并在各种环境中展示了其强大的鲁棒性。我们的方法灵感来自现有方法，这些方法突出了段落中的历史答案。但是，我们不是通过修改段落令牌嵌入来突出显示，而是直接在段落文本中添加文本提示。我们的方法简单，易于插入实际上任何模型，并且非常有效，因此我们建议它作为未来模型开发人员的起点。我们还希望我们的研究和见解还会提高人们对以鲁棒性评估的重要性的认识，除了获得较高的排行榜分数，从而提供了更好的CQA系统。

Most works on modeling the conversation history in Conversational Question Answering (CQA) report a single main result on a common CQA benchmark. While existing models show impressive results on CQA leaderboards, it remains unclear whether they are robust to shifts in setting (sometimes to more realistic ones), training data size (e.g. from large to small sets) and domain. In this work, we design and conduct the first large-scale robustness study of history modeling approaches for CQA. We find that high benchmark scores do not necessarily translate to strong robustness, and that various methods can perform extremely differently under different settings. Equipped with the insights from our study, we design a novel prompt-based history modeling approach, and demonstrate its strong robustness across various settings. Our approach is inspired by existing methods that highlight historic answers in the passage. However, instead of highlighting by modifying the passage token embeddings, we add textual prompts directly in the passage text. Our approach is simple, easy-to-plug into practically any model, and highly effective, thus we recommend it as a starting point for future model developers. We also hope that our study and insights will raise awareness to the importance of robustness-focused evaluation, in addition to obtaining high leaderboard scores, leading to better CQA systems.

下载PDF全文

下载文献需遵守相关版权规定

论文标题