论文标题

迭代数据编程,用于扩展文本分类语料库

Iterative Data Programming for Expanding Text Classification Corpora

论文作者

Mallinar, Neil, Shah, Abhishek, Ho, Tin Kam, Ugrani, Rajendra, Gupta, Ayush

论文摘要

现实世界中的文本分类任务通常需要许多标记为昂贵的培训示例。机器教学方面的最新进展,特别是数据编程范式,通过建立弱模型(也称为标签功能)的一般框架来迅速创建培训数据集,并通过集合学习技术来降低它们。我们提出了一种快速,简单的数据编程方法,用于通过最少的监督生成基于邻里的弱模型来增强文本数据集。此外,我们的方法采用迭代程序来从大量未标记的数据中识别出稀疏分布的示例。迭代数据编程技术可改善较新的弱模型,因为使用人类中的人类确认了标记的数据。我们在句子分类任务上显示了经验结果,包括从一项改善对话代理中意图识别的任务的实证结果。

Real-world text classification tasks often require many labeled training examples that are expensive to obtain. Recent advancements in machine teaching, specifically the data programming paradigm, facilitate the creation of training data sets quickly via a general framework for building weak models, also known as labeling functions, and denoising them through ensemble learning techniques. We present a fast, simple data programming method for augmenting text data sets by generating neighborhood-based weak models with minimal supervision. Furthermore, our method employs an iterative procedure to identify sparsely distributed examples from large volumes of unlabeled data. The iterative data programming techniques improve newer weak models as more labeled data is confirmed with human-in-loop. We show empirical results on sentence classification tasks, including those from a task of improving intent recognition in conversational agents.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源