通过加强学习的多任务融合，以使建议系统中的长期用户满意度

论文标题

通过加强学习的多任务融合，以使建议系统中的长期用户满意度

Multi-Task Fusion via Reinforcement Learning for Long-Term User Satisfaction in Recommender Systems

论文作者

Zhang, Qihua, Liu, Junning, Dai, Yuzhuo, Qi, Yiyan, Yuan, Yifan, Zheng, Kunlun, Huang, Fan, Tan, Xianfeng

论文摘要

推荐系统（RS）是一个重要的在线应用程序，每天都会影响数十亿个用户。主流RS排名框架由两个部分组成：多任务学习模型（MTL），该模型可以预测各种用户反馈，即点击，赞，喜欢，共享和多任务融合模型（MTF），将多任务输出组合为一个最终排名评分，以与用户满意度相对于一个最终的排名评分。关于融合模型的研究并不多，尽管它对最终建议作为排名的最后一个关键过程有很大的影响。为了优化长期用户满意度，而不是贪婪地获得即时回报，我们将MTF任务作为Markov决策过程（MDP）在建议会议中，并提出了基于批处理加固学习（RL）的多任务融合框架（BATCHRL-MTF），其中包括批量RL框架和在线勘探。前者利用批处理RL从固定的批处理数据离线学习最佳推荐策略以使其长期用户满意度，而后者则探索了潜在的高价值动作在线，以突破本地最佳难题。通过对用户行为的全面调查，我们通过从用户粘性和用户活动性的两个方面的微妙启发式方法对用户满意度进行了建模。最后，我们对十亿个样本级别的现实数据集进行了广泛的实验，以显示模型的有效性。我们建议保守的离线政策估计器（保守 - 访问器）以离线测试我们的模型。此外，我们在实际推荐环境中进行在线实验，以比较不同模型的性能。作为MTF任务中使用的少数几批RL研究之一，我们的模型也已部署在一个大规模的工业短视频平台上，为数亿用户提供服务。

Recommender System (RS) is an important online application that affects billions of users every day. The mainstream RS ranking framework is composed of two parts: a Multi-Task Learning model (MTL) that predicts various user feedback, i.e., clicks, likes, sharings, and a Multi-Task Fusion model (MTF) that combines the multi-task outputs into one final ranking score with respect to user satisfaction. There has not been much research on the fusion model while it has great impact on the final recommendation as the last crucial process of the ranking. To optimize long-term user satisfaction rather than obtain instant returns greedily, we formulate MTF task as Markov Decision Process (MDP) within a recommendation session and propose a Batch Reinforcement Learning (RL) based Multi-Task Fusion framework (BatchRL-MTF) that includes a Batch RL framework and an online exploration. The former exploits Batch RL to learn an optimal recommendation policy from the fixed batch data offline for long-term user satisfaction, while the latter explores potential high-value actions online to break through the local optimal dilemma. With a comprehensive investigation on user behaviors, we model the user satisfaction reward with subtle heuristics from two aspects of user stickiness and user activeness. Finally, we conduct extensive experiments on a billion-sample level real-world dataset to show the effectiveness of our model. We propose a conservative offline policy estimator (Conservative-OPEstimator) to test our model offline. Furthermore, we take online experiments in a real recommendation environment to compare performance of different models. As one of few Batch RL researches applied in MTF task successfully, our model has also been deployed on a large-scale industrial short video platform, serving hundreds of millions of users.

下载PDF全文

下载文献需遵守相关版权规定

论文标题