kubernetes群集中深度学习应用程序的投机容器调度

论文标题

kubernetes群集中深度学习应用程序的投机容器调度

Speculative Container Scheduling for Deep Learning Applications in a Kubernetes Cluster

论文作者

Mao, Ying, Fu, Yuqi, Zheng, Wenjia, Cheng, Long, Liu, Qingzhi, Tao, Dingwen

论文摘要

在过去的十年中，我们目睹了从各种来源收集的数据量的急剧增加。由于比以往任何时候都可以收集和分析更多信息，因此数据的爆炸量改变了世界。为了最大程度地利用利用，已经开发了各种机器和深度学习模型，例如CNN [1]和RNN [2]，从不同的角度研究数据并提取有价值的信息。尽管数据驱动的应用程序改善了无数的产品，但用于高参数调整的培训模型仍然是一个耗时且资源密集的过程。云计算为培训深度学习应用程序提供了基础架构支持。云服务提供商，例如Amazon Web Services [3]，为分享物理资源（例如CPU和内存）的客户端创建一个孤立的虚拟环境（虚拟机和容器）。在云上，实施了资源管理方案，以使用户之间更好地共享并提高系统范围的性能。但是，一般的调度方法（例如差异优先级和平衡的资源调度程序）与深度学习工作负载不能很好地工作。在这个项目中，我们提出了Specon，这是一种新颖的容器调度程序，可针对快速的深度学习应用进行优化。基于虚拟化容器，例如Kubernetes [4]和Docker [5]，Specon分析了培训过程的共同特征。我们设计了一套算法，以监视培训的进度，并推测慢增长的模型以释放资源以供快速增长。具体而言，广泛的实验表明，规格将单个工作的完成时间提高了41.5％，全系统范围14.8％，而在MakePan方面，则可以提高41.5％，14.8％。

In the past decade, we have witnessed a dramatically increasing volume of data collected from varied sources. The explosion of data has transformed the world as more information is available for collection and analysis than ever before. To maximize the utilization, various machine and deep learning models have been developed, e.g. CNN [1] and RNN [2], to study data and extract valuable information from different perspectives. While data-driven applications improve countless products, training models for hyperparameter tuning is still a time-consuming and resource-intensive process. Cloud computing provides infrastructure support for the training of deep learning applications. The cloud service providers, such as Amazon Web Services [3], create an isolated virtual environment (virtual machines and containers) for clients, who share physical resources, e.g., CPU and memory. On the cloud, resource management schemes are implemented to enable better sharing among users and boost the system-wide performance. However, general scheduling approaches, such as spread priority and balanced resource schedulers, do not work well with deep learning workloads. In this project, we propose SpeCon, a novel container scheduler that is optimized for shortlived deep learning applications. Based on virtualized containers, such as Kubernetes [4] and Docker [5], SpeCon analyzes the common characteristics of training processes. We design a suite of algorithms to monitor the progress of the training and speculatively migrate the slow-growing models to release resources for fast-growing ones. Specifically, the extensive experiments demonstrate that SpeCon improves the completion time of an individual job by up to 41.5%, 14.8% system-wide and 24.7% in terms of makespan.

下载PDF全文

下载文献需遵守相关版权规定

论文标题