论文标题
著作归因的句子结构的自我监督的表示
A Self-supervised Representation Learning of Sentence Structure for Authorship Attribution
论文作者
论文摘要
文档中句子的句法结构大大为其作者写作风格提供了信息。近年来,句子表示学习已被广泛探索,并且已经证明它改善了许多域中不同下游任务的概括。尽管在几项研究中利用探测方法表明,这些学习的上下文表示隐式编码了一些语法,但显式的语法信息进一步改善了在作者身份归因领域中深神经模型的性能。这些观察结果激发了我们研究句子句法结构的明确表示。在本文中,我们提出了一个自我监督的框架,用于学习句子的结构表现。自我监督的网络包含两个组件;词汇子网和句法子网络分别以单词及其相应的结构标签为输入。由于单词对其结构标签的n to-1映射,每个单词都将嵌入到主要带有结构信息的向量表示中。我们使用不同的探测任务评估句子的学习结构表示,然后在作者归因任务中使用它们。我们的实验结果表明,当与现有的预训练单词嵌入串联时,结构嵌入会显着改善分类任务。
Syntactic structure of sentences in a document substantially informs about its authorial writing style. Sentence representation learning has been widely explored in recent years and it has been shown that it improves the generalization of different downstream tasks across many domains. Even though utilizing probing methods in several studies suggests that these learned contextual representations implicitly encode some amount of syntax, explicit syntactic information further improves the performance of deep neural models in the domain of authorship attribution. These observations have motivated us to investigate the explicit representation learning of syntactic structure of sentences. In this paper, we propose a self-supervised framework for learning structural representations of sentences. The self-supervised network contains two components; a lexical sub-network and a syntactic sub-network which take the sequence of words and their corresponding structural labels as the input, respectively. Due to the n-to-1 mapping of words to their structural labels, each word will be embedded into a vector representation which mainly carries structural information. We evaluate the learned structural representations of sentences using different probing tasks, and subsequently utilize them in the authorship attribution task. Our experimental results indicate that the structural embeddings significantly improve the classification tasks when concatenated with the existing pre-trained word embeddings.