Lambda学习者：数据流上的快速增量学习

论文标题

Lambda学习者：数据流上的快速增量学习

Lambda Learner: Fast Incremental Learning on Data Streams

论文作者

Ramanath, Rohan, Salomatin, Konstantin, Gee, Jeffrey D., Talanine, Kirill, Dalal, Onkar, Polatkan, Gungor, Smoot, Sara, Kumar, Deepak

论文摘要

机器学习最公认的应用之一是决定哪些内容显示网站访问者。当观察数据来自高速，用户生成的数据流时，机器学习方法会在模型复杂性，培训时间和计算成本之间执行平衡行为。此外，当模型新鲜度至关重要时，模型的训练就会受到时间约束。平行的批次离线培训虽然可以水平扩展，但通常不进行时间考虑或成本效益。在本文中，我们提出了Lambda Learner，这是一个新的培训模型框架，以响应来自数据流的迷你批次来进行增量更新。我们表明，我们的框架的最终模型会紧密估算一个定期更新的模型，该模型在离线数据上训练，并在模型更新时间敏感时胜过它。我们提供了理论上的证明，即增量学习更新改善了陈旧批处理模型的损失功能。我们在赞助的内容平台上为大型社交网络提供了大规模部署，为跨不同渠道（例如桌面，移动设备）提供了数亿用户。我们从算法和基础架构的角度解决了挑战和复杂性，并说明了培训数据的计算，存储和流生产的系统详细信息。

One of the most well-established applications of machine learning is in deciding what content to show website visitors. When observation data comes from high-velocity, user-generated data streams, machine learning methods perform a balancing act between model complexity, training time, and computational costs. Furthermore, when model freshness is critical, the training of models becomes time-constrained. Parallelized batch offline training, although horizontally scalable, is often not time-considerate or cost-effective. In this paper, we propose Lambda Learner, a new framework for training models by incremental updates in response to mini-batches from data streams. We show that the resulting model of our framework closely estimates a periodically updated model trained on offline data and outperforms it when model updates are time-sensitive. We provide theoretical proof that the incremental learning updates improve the loss-function over a stale batch model. We present a large-scale deployment on the sponsored content platform for a large social network, serving hundreds of millions of users across different channels (e.g., desktop, mobile). We address challenges and complexities from both algorithms and infrastructure perspectives, and illustrate the system details for computation, storage, and streaming production of training data.

下载PDF全文

下载文献需遵守相关版权规定

论文标题