论文标题

在电子商务中进行地址分类的深层上下文嵌入

Deep Contextual Embeddings for Address Classification in E-commerce

论文作者

Mangalgi, Shreyas, Kumar, Lakshya, Tallamraju, Ravindra Babu

论文摘要

在进入运输地址时,印度等发展中国家的电子商务客户倾向于不遵循固定格式。由于缺乏固有的结构或层次结构,因此解析此类地址具有挑战性。必须了解地址的语言,以便可以在没有延误的情况下将货物路由。在本文中,我们提出了一种新颖的方法,通过从自然语言处理(NLP)的最新进展中获得动力来理解客户地址。我们还使用编辑距离和语音算法的组合为地址制定了不同的预处理步骤。然后,我们将使用Word2Vec使用TF-IDF,BI-LSTM和基于BERT的方法来创建向量表示的任务。我们将这些方法与北印度城市的子区域分类任务进行比较。通过实验,我们证明了普遍的罗伯塔模型的有效性,该模型已在大量的地址语料库中进行了用于语言建模任务的大量培训。我们提出的罗伯塔模型的分类准确性约为90%,而对于子区域分类任务的最小文本预处理优于所有其他方法。一旦预先训练,Roberta模型就可以通过供应链中的各种下游任务进行微调,例如Pincode建议和地理编码。即使标记有限的数据,该模型即使对此类任务也可以很好地概括。据我们所知,这是第一个同类研究,提出了一种新颖的方法,通过预培训语言模型并将其用于不同的目的。

E-commerce customers in developing nations like India tend to follow no fixed format while entering shipping addresses. Parsing such addresses is challenging because of a lack of inherent structure or hierarchy. It is imperative to understand the language of addresses, so that shipments can be routed without delays. In this paper, we propose a novel approach towards understanding customer addresses by deriving motivation from recent advances in Natural Language Processing (NLP). We also formulate different pre-processing steps for addresses using a combination of edit distance and phonetic algorithms. Then we approach the task of creating vector representations for addresses using Word2Vec with TF-IDF, Bi-LSTM and BERT based approaches. We compare these approaches with respect to sub-region classification task for North and South Indian cities. Through experiments, we demonstrate the effectiveness of generalized RoBERTa model, pre-trained over a large address corpus for language modelling task. Our proposed RoBERTa model achieves a classification accuracy of around 90% with minimal text preprocessing for sub-region classification task outperforming all other approaches. Once pre-trained, the RoBERTa model can be fine-tuned for various downstream tasks in supply chain like pincode suggestion and geo-coding. The model generalizes well for such tasks even with limited labelled data. To the best of our knowledge, this is the first of its kind research proposing a novel approach of understanding customer addresses in e-commerce domain by pre-training language models and fine-tuning them for different purposes.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源