DISTDGL：数十亿个图形的分布式图神经网络培训

论文标题

DISTDGL：数十亿个图形的分布式图神经网络培训

DistDGL: Distributed Graph Neural Network Training for Billion-Scale Graphs

论文作者

Zheng, Da, Ma, Chao, Wang, Minjie, Zhou, Jinjing, Su, Qidong, Song, Xiang, Gan, Quan, Zhang, Zheng, Karypis, George

论文摘要

图神经网络（GNN）在从图形结构数据中学习方面取得了巨大的成功。它们被广泛用于各种应用程序，例如建议，欺诈检测和搜索。在这些域中，这些图通常很大，包含数亿个节点和数十亿个边缘。为了应对这一挑战，我们开发了DistDGL，这是一种在一组机器上以迷你批量方式培训GNN的系统。 DistDGL基于Deep Graph库（DGL），这是一个流行的GNN开发框架。 DistDGL在机器上分发图形及其关联的数据（初始功能和嵌入），并使用此分布来遵循所有者计算规则来得出计算分解。 DistDGL遵循一种同步训练方法，并允许形成迷你批次的自我网络，以包括非本地节点。为了最大程度地减少与分布式计算相关的开销，DistDGL使用高质量和轻质的最小图形分区算法以及多个平衡约束。这使其可以减少通信开销并静态平衡计算。它通过复制光环节点和使用稀疏嵌入更新来进一步降低通信。这些设计选择的组合使DistDGL可以训练高质量的模型，同时达到高平行效率和内存可扩展性。我们展示了对电感和跨传感GNN模型的优化。我们的结果表明，DistDGL在不损害模型准确性的情况下实现线性加速，仅需要13秒即可完成一个训练时期的训练时期，用于在带有16台机器的群集上具有1亿个节点和30亿个边缘的图形。现在，DISTDGL作为DGL的一部分即可公开使用：https：//github.com/dmlc/dgl/tree/master/master/python/dgl/distributed。

Graph neural networks (GNN) have shown great success in learning from graph-structured data. They are widely used in various applications, such as recommendation, fraud detection, and search. In these domains, the graphs are typically large, containing hundreds of millions of nodes and several billions of edges. To tackle this challenge, we develop DistDGL, a system for training GNNs in a mini-batch fashion on a cluster of machines. DistDGL is based on the Deep Graph Library (DGL), a popular GNN development framework. DistDGL distributes the graph and its associated data (initial features and embeddings) across the machines and uses this distribution to derive a computational decomposition by following an owner-compute rule. DistDGL follows a synchronous training approach and allows ego-networks forming the mini-batches to include non-local nodes. To minimize the overheads associated with distributed computations, DistDGL uses a high-quality and light-weight min-cut graph partitioning algorithm along with multiple balancing constraints. This allows it to reduce communication overheads and statically balance the computations. It further reduces the communication by replicating halo nodes and by using sparse embedding updates. The combination of these design choices allows DistDGL to train high-quality models while achieving high parallel efficiency and memory scalability. We demonstrate our optimizations on both inductive and transductive GNN models. Our results show that DistDGL achieves linear speedup without compromising model accuracy and requires only 13 seconds to complete a training epoch for a graph with 100 million nodes and 3 billion edges on a cluster with 16 machines. DistDGL is now publicly available as part of DGL:https://github.com/dmlc/dgl/tree/master/python/dgl/distributed.

下载PDF全文

下载文献需遵守相关版权规定

论文标题