Optimizing Graph Neural Network Training on Large Graphs in A Distributed Setting

Murzynowski, Philip

Author(s)

Murzynowski, Philip

DownloadThesis PDF (4.285Mb)

Advisor

Leiserson, Charles E.

Kaler, Tim

Iliopoulos, Alexandros-Stavros

Terms of use

In Copyright - Educational Use Permitted Copyright MIT http://rightsstatements.org/page/InC-EDU/1.0/

Metadata

Show full item record

Abstract

Graph neural networks (GNNs) are an important class of methods for leveraging the information present in graph structures to perform various learning tasks. Distributed GNNs can improve the performance of GNN execution by dividing computation among multiple machines and scale to large graphs by partitioning graph features and the graph structure. Although distributed GNNs are able to achieve self-relative speedup, they are often slower than well-optimized code running on a single machine. For example, evaluation of the prevalent Distributed DGL system on graphs in the Open Graph Benchmark shows Distributed DGL can achieve speedup of over 2× when moving from one to four nodes, but execution of Distributed DGL on 4 nodes is 2× slower than a well-optimized GNN system, such as the SALIENT system, on a single machine. In my thesis, I argue that it is possible for a distributed GNN system to be both fast and scalable. Specifically, I show that it is possible to match the performance of well-optimized, non-distributed codes for GNN training and also achieve good scalability when running in the distributed setting. I present a system called Distributed SALIENT and motivate its design through profiling and identifying bottlenecks that arise in the distributed setting. Key components of Distributed SALIENT include the use of well-optimized code for local computations, pipelining of inter-machine communication, and a careful trade-off between data partitioning and partial replication. I evaluate Distributed SALIENT on the Open Graph Benchmark (OGB) and show that Distributed SALIENT achieves good speedup compared to SALIENT’s well-optimized single-node code while only using replication factors of roughly 5%. In fact, in experiments with training a 3-layer GraphSAGE model on the large OGB papers100M data set, Distributed SALIENT on 8 nodes is 8.6x faster than SALIENT on 1 node.

Date issued

2022-09

URI

https://hdl.handle.net/1721.1/164487

Department

Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science

Publisher

Massachusetts Institute of Technology

Collections

Graduate Theses