| dc.description.abstract | Graph representation learning has gained significant traction in critical domains including finance, social networks, and transportation systems due to its successful application to graphstructured data. Graph neural networks (GNNs), which integrate the power of deep learning with graph structures, have emerged as the leading methods in this field, delivering superior performance across diverse graph related tasks. However, training graph neural networks on large-scale datasets encounters scalability challenges on current system architectures. First, the sparse, non-localized structures of real-world graphs lead to inefficiencies in data sampling and movement. This characteristic heavily stresses system input/output (I/O), particularly burdening the peripheral buses during the sampling phase of GNN training. Second, the suboptimal mapping of training procedure to GPU kernels leads to compute inefficiencies, including substantial kernel orchestration overhead and redundant computations. Addressing these challenges requires a comprehensive, full-stack optimization approach that fully leverages hardware capabilities. This thesis presents two complementary works to achieve the goal. The first work, Hanoi, unblocks the data loading bottleneck in out-of-core GNN training by co-designing the sampling algorithms to align with the hierarchical memory organization of commodity hardware. Hanoi drastically reduces I/O traffic to external storage, delivering up to 4.2× speedup over strong baselines with negligible impacts on the model quality. Notably, Hanoi is able to obtain competitive performance close to in-memory training with only a fraction of memory requirements. Building on this foundation, the second work, Joestar, introduces a unified framework for optimized GNN training on GPUs. Joestar adapts the multistage sampling approach from Hanoi to in-memory training which frees CPUs from heavy data loading workloads. Joestar also identifies novel kernel fusion opportunities and formulates better execution schedules by jointly considering the sampling and compute stages. Combined with compiler infrastructure in PyTorch, Joestar achieves state-of-the-art GNN training throughputs for billion-edge graph datasets on a single GPU. | |