论文标题
纠正自动更正:通过培训数据增加校正上下文感知的印刷错误校正
Correcting the Autocorrect: Context-Aware Typographical Error Correction via Training Data Augmentation
论文作者
论文摘要
在本文中,我们探讨了基于现实世界统计数据的人工产生的印刷错误。我们首先借鉴了一小部分注释数据来计算拼写错误统计信息。然后调用这些错误将错误引入大量更大的语料库中。生成方法使我们能够产生特别具有挑战性的错误,需要上下文感知的错误检测。我们使用它来创建一组英语错误检测和校正数据集。最后,我们研究了基于此数据检测和纠正错误的机器学习模型的有效性。数据集可在http://typo.nlproc.org上找到
In this paper, we explore the artificial generation of typographical errors based on real-world statistics. We first draw on a small set of annotated data to compute spelling error statistics. These are then invoked to introduce errors into substantially larger corpora. The generation methodology allows us to generate particularly challenging errors that require context-aware error detection. We use it to create a set of English language error detection and correction datasets. Finally, we examine the effectiveness of machine learning models for detecting and correcting errors based on this data. The datasets are available at http://typo.nlproc.org