A Reconfigurable, Distributed-Memory Accelerator for Sparse Applications

Golden, Courtney K.

Author(s)

Golden, Courtney K.

DownloadThesis PDF (930.9Kb)

Advisor

Sanchez, Daniel

Emer, Joel S.

Terms of use

In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/

Metadata

Show full item record

Abstract

Iterative sparse matrix computations lie at the heart of many scientific computing and graph analytics algorithms. On conventional systems, their irregular memory accesses and low arithmetic intensity create challenging memory bandwidth bottlenecks. To overcome such bottlenecks, distributed-SRAM architectures use tiled arrays of high-bandwidth local storage to achieve very high aggregate memory bandwidth. However, current distributedSRAM architectures suffer from either poor programmability due to over-specialization or poor compute performance due to inefficient general-purpose hardware. This thesis proposes Quartz, a new architecture that uses short dataflow tasks and reconfigurable compute in a distributed-SRAM system to deliver both high performance and high programmability. Unlike traditional sparse CGRAs or on-die reconfigurable engines, Quartz allows reconfigurable compute to be highly utilized and scaled by (1) providing high memory bandwidth to each processing element and (2) introducing a task-level dataflow execution model that fits this new setting. Our execution model dynamically reconfigures tile hardware based on inter-tile messages to execute tasks on local data with fine-grained data partitioning across tiles. To make execution efficient, we explore novel data partitioning techniques that use graph and hypergraph partitioning to minimize network traffic and balance load. This is especially challenging for computations where one operand’s sparsity pattern (i.e., distribution of nonzeros) exhibits dynamic behavior across iterations, and we are the first to provide techniques to address this case. To ensure programmability, we show how a wide range of computations (expressed in an extended version of tensor algebra’s Einsum notation) and flexible data distributions can be systematically captured in small tasks for execution on Quartz. We evaluate Quartz in simulation, using an 8-chiplet design with 2,048 tiles and 824 MB of SRAM per chiplet, running six different iterative sparse applications from scientific computing and graph analytics. Quartz’s architecture, data partitioning techniques, and programming model together achieve gmean 26.2× speedup over the prior state-of-the-art programmable distributed-SRAM architecture.

Date issued

2025-05

URI

https://hdl.handle.net/1721.1/163707

Department

Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science

Publisher

Massachusetts Institute of Technology

Collections

Graduate Theses