论文标题
VSEC-LDA:通过嵌入式词汇选择提高主题建模
VSEC-LDA: Boosting Topic Modeling with Embedded Vocabulary Selection
论文作者
论文摘要
主题建模在许多问题中发现了广泛的应用,因为数据的潜在结构对于典型的推理任务至关重要。应用主题模型时,相对标准的预处理步骤是首先构建一个频繁单词的词汇。这样的一般预处理步骤通常独立于主题建模阶段,因此不能保证预先生成的词汇可以支持某些适合给定任务的最佳(甚至有意义的)主题模型的推论,尤其是对于涉及“视觉单词”的计算机视觉应用程序。在本文中,我们提出了一种用于主题建模的新方法,称为词汇 - 选择与通信-LDA(VSEC-LDA),该方法同时选择了潜在模型,同时选择了最相关的单词。单词的选择是由基于熵的指标驱动的,该指标可以测量单词对基础模型的相对贡献,并在学习模型时动态完成。我们提出了VSEC-LDA的三种变体,并通过对不同应用程序的合成和实际数据库进行实验评估了所提出的方法。结果表明,内置词汇选择的有效性及其在提高主题建模性能方面的重要性。
Topic modeling has found wide application in many problems where latent structures of the data are crucial for typical inference tasks. When applying a topic model, a relatively standard pre-processing step is to first build a vocabulary of frequent words. Such a general pre-processing step is often independent of the topic modeling stage, and thus there is no guarantee that the pre-generated vocabulary can support the inference of some optimal (or even meaningful) topic models appropriate for a given task, especially for computer vision applications involving "visual words". In this paper, we propose a new approach to topic modeling, termed Vocabulary-Selection-Embedded Correspondence-LDA (VSEC-LDA), which learns the latent model while simultaneously selecting most relevant words. The selection of words is driven by an entropy-based metric that measures the relative contribution of the words to the underlying model, and is done dynamically while the model is learned. We present three variants of VSEC-LDA and evaluate the proposed approach with experiments on both synthetic and real databases from different applications. The results demonstrate the effectiveness of built-in vocabulary selection and its importance in improving the performance of topic modeling.