Task Scheduling Techniques to Accelerate RTL Simulation

Sheikhha, Shabnam

Author(s)

Sheikhha, Shabnam

DownloadThesis PDF (1.223Mb)

Advisor

Sanchez, Daniel

Terms of use

In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/

Metadata

Show full item record

Abstract

Fast simulation of digital circuits is crucial to build modern chips. Slow simulation lengthens chip design time and makes bugs more frequent. While simulation can happen at different levels of abstraction, Register-Transfer-Level (RTL) simulation is the usual bottleneck in chip design, as it is needed for ongoing debugging and evaluation. Current simulators scale poorly across CPU cores, because they are unable to exploit the fine-grained parallelism inherent in simulation workloads. We present ASH, a parallel architecture tailored to simulation workloads. ASH consists of a tightly codesigned hardware architecture and compiler for RTL simulation. ASH exploits two key opportunities. First, it performs dataflow execution of small tasks to leverage the fine-grained parallelism in simulation workloads. Dataflow execution exposes abundant parallelism, as each task can run as soon as its inputs are available. Second, it performs selective event-driven execution to run only the fraction of the design exercised each cycle, skipping ineffectual tasks. Selective execution introduces dynamic data dependences since skipped tasks do not communicate data. ASH employs speculative execution to handle these dependencies. ASH’s hardware provides a novel combination of dataflow and speculative execution, and ASH’s compiler features several novel techniques to automatically leverage this hardware. The key compiler techniques include a novel partitioning for minimizing data communication while maintaining load balance, and a strategic coarsening mechanism to reduce the overheads of fine-grained tasks. We evaluate ASH in simulation using large Verilog designs. With 256 simple cores, ASH is gmean 1,485× faster than 1-core Verilator, and it is 32× faster than Verilator on a server CPU with 32 complex cores while using 3× less area.

Date issued

2023-06

URI

https://hdl.handle.net/1721.1/164507

Department

Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science

Publisher

Massachusetts Institute of Technology

Collections

Graduate Theses