论文标题
具有冗余的分布式系统中的多样性/并行性权衡
Diversity/Parallelism Trade-off in Distributed Systems with Redundancy
论文作者
论文摘要
随着许多机器学习和其他算法的复杂性和数据需求的增加,分布式计算对于满足不断增长的计算和存储需求的必要条件是必要的,因为它可以平行执行构成大型计算工作的较小任务。但是,任务服务时间的随机波动会导致长时间执行时间的散布任务。冗余,以任务复制和擦除编码的形式提供了多样性,可以在执行冗余任务的子集时完成工作,从而消除对散乱任务的依赖性。在约束资源的情况下(这里有固定数量的并行服务器),增加冗余可减少并行性的可用资源。在本文中,我们表征了多样性与并行性权衡的折衷,并确定了在复制,编码和分裂中的最佳策略,从而最大程度地减少了预期的工作完成时间。我们考虑三个通用的服务时间分布,并建立三个模型,以描述这些分布的比例用任务大小。我们发现,具有不同缩放模型的不同分布在不同级别的冗余水平上最佳运行,因此可能需要非常不同的代码速率。
As numerous machine learning and other algorithms increase in complexity and data requirements, distributed computing becomes necessary to satisfy the growing computational and storage demands, because it enables parallel execution of smaller tasks that make up a large computing job. However, random fluctuations in task service times lead to straggling tasks with long execution times. Redundancy, in the form of task replication and erasure coding, provides diversity that allows a job to be completed when only a subset of redundant tasks is executed, thus removing the dependency on the straggling tasks. In situations of constrained resources (here a fixed number of parallel servers), increasing redundancy reduces the available resources for parallelism. In this paper, we characterize the diversity vs. parallelism trade-off and identify the optimal strategy, among replication, coding and splitting, which minimizes the expected job completion time. We consider three common service time distributions and establish three models that describe scaling of these distributions with the task size. We find that different distributions with different scaling models operate optimally at different levels of redundancy, and thus may require very different code rates.