MIT Libraries logoDSpace@MIT

MIT
View Item 
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Graduate Theses
  • View Item
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Graduate Theses
  • View Item
JavaScript is disabled for your browser. Some features of this site may not work without it.

Optimizing Graph Neural Network Training on Large Graphs in A Distributed Setting

Author(s)
Murzynowski, Philip
Thumbnail
DownloadThesis PDF (4.285Mb)
Advisor
Leiserson, Charles E.
Kaler, Tim
Iliopoulos, Alexandros-Stavros
Terms of use
In Copyright - Educational Use Permitted Copyright MIT http://rightsstatements.org/page/InC-EDU/1.0/
Metadata
Show full item record
Abstract
Graph neural networks (GNNs) are an important class of methods for leveraging the information present in graph structures to perform various learning tasks. Distributed GNNs can improve the performance of GNN execution by dividing computation among multiple machines and scale to large graphs by partitioning graph features and the graph structure. Although distributed GNNs are able to achieve self-relative speedup, they are often slower than well-optimized code running on a single machine. For example, evaluation of the prevalent Distributed DGL system on graphs in the Open Graph Benchmark shows Distributed DGL can achieve speedup of over 2× when moving from one to four nodes, but execution of Distributed DGL on 4 nodes is 2× slower than a well-optimized GNN system, such as the SALIENT system, on a single machine. In my thesis, I argue that it is possible for a distributed GNN system to be both fast and scalable. Specifically, I show that it is possible to match the performance of well-optimized, non-distributed codes for GNN training and also achieve good scalability when running in the distributed setting. I present a system called Distributed SALIENT and motivate its design through profiling and identifying bottlenecks that arise in the distributed setting. Key components of Distributed SALIENT include the use of well-optimized code for local computations, pipelining of inter-machine communication, and a careful trade-off between data partitioning and partial replication. I evaluate Distributed SALIENT on the Open Graph Benchmark (OGB) and show that Distributed SALIENT achieves good speedup compared to SALIENT’s well-optimized single-node code while only using replication factors of roughly 5%. In fact, in experiments with training a 3-layer GraphSAGE model on the large OGB papers100M data set, Distributed SALIENT on 8 nodes is 8.6x faster than SALIENT on 1 node.
Date issued
2022-09
URI
https://hdl.handle.net/1721.1/164487
Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
Publisher
Massachusetts Institute of Technology

Collections
  • Graduate Theses

Browse

All of DSpaceCommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

My Account

Login

Statistics

OA StatisticsStatistics by CountryStatistics by Department
MIT Libraries
PrivacyPermissionsAccessibilityContact us
MIT
Content created by the MIT Libraries, CC BY-NC unless otherwise noted. Notify us about copyright concerns.