探索端到端口语理解的转移学习

论文标题

探索端到端口语理解的转移学习

Exploring Transfer Learning For End-to-End Spoken Language Understanding

论文作者

Rongali, Subendhu, Liu, Beiye, Cai, Liwei, Arkoudas, Konstantine, Su, Chengwei, Hamza, Wael

论文摘要

诸如Alexa，Siri和Google Assistant之类的语音助手通常使用两阶段的口语了解管道；首先，一个自动语音识别（ASR）组件来处理客户语音并生成文本转录，然后是自然语言理解（NLU）组件，以将转录映射到可行的假设。直接从语音到假设的端到端（E2E）系统是一个更具吸引力的选择。这些系统被证明较小，更快且优化了更好。但是，它们需要大量的端到端培训数据，此外，不要利用已经可用的ASR和NLU培训数据。在这项工作中，我们提出了一个E2E系统，该系统旨在共同培训多个语音到文本任务，例如ASR（语音转录）和SLU（语音 - 假设），以及文本到文本任务，例如NLU（文本 - 假设）。我们将其称为Audio-Text All任务（AT-AT）模型，我们表明它超过了经过培训的单个任务的E2E模型的性能，尤其是对有限数据训练的任务。我们在一个内部音乐数据集和两个公共数据集（FluentsPeech和Snips Audio）上显示了此结果，我们可以在其中实现最新的结果。由于我们的模型可以同时处理语音和文本输入序列并学会预测目标序列，因此它还使我们能够通过仅训练来自新域的文本 - 假设数据（没有任何语音）来进行零摄像的E2E SLU。我们在Facebook顶部数据集上评估了模型的这种能力，并为Zeroshot E2E性能设定了新的基准。我们将很快发布为未来研究的顶级数据集收集的音频数据。

Voice Assistants such as Alexa, Siri, and Google Assistant typically use a two-stage Spoken Language Understanding pipeline; first, an Automatic Speech Recognition (ASR) component to process customer speech and generate text transcriptions, followed by a Natural Language Understanding (NLU) component to map transcriptions to an actionable hypothesis. An end-to-end (E2E) system that goes directly from speech to a hypothesis is a more attractive option. These systems were shown to be smaller, faster, and better optimized. However, they require massive amounts of end-to-end training data and in addition, don't take advantage of the already available ASR and NLU training data. In this work, we propose an E2E system that is designed to jointly train on multiple speech-to-text tasks, such as ASR (speech-transcription) and SLU (speech-hypothesis), and text-to-text tasks, such as NLU (text-hypothesis). We call this the Audio-Text All-Task (AT-AT) Model and we show that it beats the performance of E2E models trained on individual tasks, especially ones trained on limited data. We show this result on an internal music dataset and two public datasets, FluentSpeech and SNIPS Audio, where we achieve state-of-the-art results. Since our model can process both speech and text input sequences and learn to predict a target sequence, it also allows us to do zero-shot E2E SLU by training on only text-hypothesis data (without any speech) from a new domain. We evaluate this ability of our model on the Facebook TOP dataset and set a new benchmark for zeroshot E2E performance. We will soon release the audio data collected for the TOP dataset for future research.

下载PDF全文

下载文献需遵守相关版权规定

论文标题