论文标题

IDK-MRC:印尼机器阅读理解的无法回答的问题

IDK-MRC: Unanswerable Questions for Indonesian Machine Reading Comprehension

论文作者

Putri, Rifki Afina, Oh, Alice

论文摘要

机器阅读理解(MRC)已成为自然语言理解(NLU)的重要任务之一,因为它通常包含在几个NLU基准中(Liang等,2020; Wilie等,2020)。但是,大多数MRC数据集仅具有可回答的问题类型,从而忽略了无法回答的问题的重要性。仅在可回答问题上培训的MRC模型将选择最有可能是答案的跨度,即使答案实际上在给定的段落中实际上并不存在(Rajpurkar等人,2018年)。这个问题尤其保留在像印尼语这样的中低农源语言中。现有的印尼MRC数据集(Purwarianti等,2007; Clark等,2020)仍然不足,因为它们的规模较小且问题类型有限,即它们仅涵盖了可回答的问题。为了填补这一空白,我们通过组合自动和手动无法回答的问题生成,以最大程度地降低手动数据集构建的成本,同时保持数据集质量,从而构建了一个名为I(n)的印尼MRC数据集,称为I(N)不知道-MRC(IDK-MRC)。结合现有的可回答问题,IDK-MRC总共包含超过10k的问题。我们的分析表明,我们的数据集大大提高了印尼MRC模型的性能,显示了无法回答的问题的大幅改进。

Machine Reading Comprehension (MRC) has become one of the essential tasks in Natural Language Understanding (NLU) as it is often included in several NLU benchmarks (Liang et al., 2020; Wilie et al., 2020). However, most MRC datasets only have answerable question type, overlooking the importance of unanswerable questions. MRC models trained only on answerable questions will select the span that is most likely to be the answer, even when the answer does not actually exist in the given passage (Rajpurkar et al., 2018). This problem especially remains in medium- to low-resource languages like Indonesian. Existing Indonesian MRC datasets (Purwarianti et al., 2007; Clark et al., 2020) are still inadequate because of the small size and limited question types, i.e., they only cover answerable questions. To fill this gap, we build a new Indonesian MRC dataset called I(n)don'tKnow- MRC (IDK-MRC) by combining the automatic and manual unanswerable question generation to minimize the cost of manual dataset construction while maintaining the dataset quality. Combined with the existing answerable questions, IDK-MRC consists of more than 10K questions in total. Our analysis shows that our dataset significantly improves the performance of Indonesian MRC models, showing a large improvement for unanswerable questions.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源