论文标题
针对专家制作的分类法的种子分层聚类
Seeded Hierarchical Clustering for Expert-Crafted Taxonomies
论文作者
论文摘要
来自许多学科(例如政治学)的从业者使用专家制作的分类法来了解大型,未标记的语料库。在这项工作中,我们研究了种子分层聚类(SHC):仅使用一小部分标记示例将自动拟合未标记数据拟合到此类分类法的任务。我们提出了HierSeed,这是一种针对此任务的新型弱监督算法,仅使用一小部分标记的种子示例。它既是数据又是计算上的效率。 HierSeed通过权衡文档密度与主题层次结构来分配文档。它在三个现实世界数据集上的SHC任务胜过无监督和监督的基线。
Practitioners from many disciplines (e.g., political science) use expert-crafted taxonomies to make sense of large, unlabeled corpora. In this work, we study Seeded Hierarchical Clustering (SHC): the task of automatically fitting unlabeled data to such taxonomies using only a small set of labeled examples. We propose HierSeed, a novel weakly supervised algorithm for this task that uses only a small set of labeled seed examples. It is both data and computationally efficient. HierSeed assigns documents to topics by weighing document density against topic hierarchical structure. It outperforms both unsupervised and supervised baselines for the SHC task on three real-world datasets.