MIT Libraries logoDSpace@MIT

MIT
View Item 
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Graduate Theses
  • View Item
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Graduate Theses
  • View Item
JavaScript is disabled for your browser. Some features of this site may not work without it.

A Reconfigurable, Distributed-Memory Accelerator for Sparse Applications

Author(s)
Golden, Courtney K.
Thumbnail
DownloadThesis PDF (930.9Kb)
Advisor
Sanchez, Daniel
Emer, Joel S.
Terms of use
In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/
Metadata
Show full item record
Abstract
Iterative sparse matrix computations lie at the heart of many scientific computing and graph analytics algorithms. On conventional systems, their irregular memory accesses and low arithmetic intensity create challenging memory bandwidth bottlenecks. To overcome such bottlenecks, distributed-SRAM architectures use tiled arrays of high-bandwidth local storage to achieve very high aggregate memory bandwidth. However, current distributedSRAM architectures suffer from either poor programmability due to over-specialization or poor compute performance due to inefficient general-purpose hardware. This thesis proposes Quartz, a new architecture that uses short dataflow tasks and reconfigurable compute in a distributed-SRAM system to deliver both high performance and high programmability. Unlike traditional sparse CGRAs or on-die reconfigurable engines, Quartz allows reconfigurable compute to be highly utilized and scaled by (1) providing high memory bandwidth to each processing element and (2) introducing a task-level dataflow execution model that fits this new setting. Our execution model dynamically reconfigures tile hardware based on inter-tile messages to execute tasks on local data with fine-grained data partitioning across tiles. To make execution efficient, we explore novel data partitioning techniques that use graph and hypergraph partitioning to minimize network traffic and balance load. This is especially challenging for computations where one operand’s sparsity pattern (i.e., distribution of nonzeros) exhibits dynamic behavior across iterations, and we are the first to provide techniques to address this case. To ensure programmability, we show how a wide range of computations (expressed in an extended version of tensor algebra’s Einsum notation) and flexible data distributions can be systematically captured in small tasks for execution on Quartz. We evaluate Quartz in simulation, using an 8-chiplet design with 2,048 tiles and 824 MB of SRAM per chiplet, running six different iterative sparse applications from scientific computing and graph analytics. Quartz’s architecture, data partitioning techniques, and programming model together achieve gmean 26.2× speedup over the prior state-of-the-art programmable distributed-SRAM architecture.
Date issued
2025-05
URI
https://hdl.handle.net/1721.1/163707
Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
Publisher
Massachusetts Institute of Technology

Collections
  • Graduate Theses

Browse

All of DSpaceCommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

My Account

Login

Statistics

OA StatisticsStatistics by CountryStatistics by Department
MIT Libraries
PrivacyPermissionsAccessibilityContact us
MIT
Content created by the MIT Libraries, CC BY-NC unless otherwise noted. Notify us about copyright concerns.