无监督的域适应性的白话搜索查询翻译

论文标题

无监督的域适应性的白话搜索查询翻译

Vernacular Search Query Translation with Unsupervised Domain Adaptation

论文作者

Kulkarni, Mandar, Garera, Nikesh

论文摘要

随着电子商务平台的民主化，一个越来越多的用户群选择在线购物。为了提供舒适和可靠的购物体验，重要的是要使用户以其选择的语言与平台进行互动。准确的查询翻译对于带有白话查询的跨语性信息检索（CLIR）至关重要。由于互联网规模的操作，电子商务平台每天都会获得数百万次搜索查询。但是，创建平行训练集以训练训练内域翻译模型很麻烦。本文提出了一种无监督的域适应方法，以不使用任何平行语料库来翻译搜索查询。我们使用开放域翻译模型（在公共语料库中进行培训），并仅使用两种语言的单语言查询将其调整为查询数据。此外，用小标签集进行微调进一步改善了结果。为了进行演示，我们显示了印地语对英文查询翻译的结果，并使用mbart-large-50模型作为改进的基线。实验结果表明，在不使用任何平行语料库的情况下，我们获得了20多个BLEU点比基线的改进，同时用小50k标签集进行微调可提供比基线的27个以上BLEU点的改进。

With the democratization of e-commerce platforms, an increasingly diversified user base is opting to shop online. To provide a comfortable and reliable shopping experience, it's important to enable users to interact with the platform in the language of their choice. An accurate query translation is essential for Cross-Lingual Information Retrieval (CLIR) with vernacular queries. Due to internet-scale operations, e-commerce platforms get millions of search queries every day. However, creating a parallel training set to train an in-domain translation model is cumbersome. This paper proposes an unsupervised domain adaptation approach to translate search queries without using any parallel corpus. We use an open-domain translation model (trained on public corpus) and adapt it to the query data using only the monolingual queries from two languages. In addition, fine-tuning with a small labeled set further improves the result. For demonstration, we show results for Hindi to English query translation and use mBART-large-50 model as the baseline to improve upon. Experimental results show that, without using any parallel corpus, we obtain more than 20 BLEU points improvement over the baseline while fine-tuning with a small 50k labeled set provides more than 27 BLEU points improvement over the baseline.

下载PDF全文

下载文献需遵守相关版权规定

论文标题