论文标题

无监督的域适应性的白话搜索查询翻译

Vernacular Search Query Translation with Unsupervised Domain Adaptation

论文作者

Kulkarni, Mandar, Garera, Nikesh

论文摘要

随着电子商务平台的民主化,一个越来越多的用户群选择在线购物。为了提供舒适和可靠的购物体验,重要的是要使用户以其选择的语言与平台进行互动。准确的查询翻译对于带有白话查询的跨语性信息检索(CLIR)至关重要。由于互联网规模的操作,电子商务平台每天都会获得数百万次搜索查询。但是,创建平行训练集以训练训练内域翻译模型很麻烦。本文提出了一种无监督的域适应方法,以不使用任何平行语料库来翻译搜索查询。我们使用开放域翻译模型(在公共语料库中进行培训),并仅使用两种语言的单语言查询将其调整为查询数据。此外,用小标签集进行微调进一步改善了结果。为了进行演示,我们显示了印地语对英文查询翻译的结果,并使用mbart-large-50模型作为改进的基线。实验结果表明,在不使用任何平行语料库的情况下,我们获得了20多个BLEU点比基线的改进,同时用小50k标签集进行微调可提供比基线的27个以上BLEU点的改进。

With the democratization of e-commerce platforms, an increasingly diversified user base is opting to shop online. To provide a comfortable and reliable shopping experience, it's important to enable users to interact with the platform in the language of their choice. An accurate query translation is essential for Cross-Lingual Information Retrieval (CLIR) with vernacular queries. Due to internet-scale operations, e-commerce platforms get millions of search queries every day. However, creating a parallel training set to train an in-domain translation model is cumbersome. This paper proposes an unsupervised domain adaptation approach to translate search queries without using any parallel corpus. We use an open-domain translation model (trained on public corpus) and adapt it to the query data using only the monolingual queries from two languages. In addition, fine-tuning with a small labeled set further improves the result. For demonstration, we show results for Hindi to English query translation and use mBART-large-50 model as the baseline to improve upon. Experimental results show that, without using any parallel corpus, we obtain more than 20 BLEU points improvement over the baseline while fine-tuning with a small 50k labeled set provides more than 27 BLEU points improvement over the baseline.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源