A Formalism of DNN Accelerator Flexibility

June 06, 2022

Architecure, DNN Accelerator, Mapping/Dataflow, Design Space Exploration, SIGMETRICS'22, Atlanta, GA


The high efficiency of domain-specific hardware accelerators for machine learning (ML) has come from specialization, with the trade-off of less configurability/ flexibility. There is growing interest in developing flexible ML accelerators to make them future-proof to the rapid evolution of Deep Neural Networks (DNNs). However, the notion of accelerator flexibility has always been used in an informal manner, restricting computer architects from conducting systematic apples-to-apples design-space exploration (DSE) across trillions of choices. In this work, we formally define accelerator flexibility and show how it can be integrated for DSE. Specifically, we capture DNN accelerator flexibility across four axes: tiling, ordering, parallelization, and array shape. We categorize existing accelerators into 16 classes based on their axes of flexibility support, and define a precise quantification of the degree of flexibility of an accelerator across each axis. We leverage these to develop a novel flexibility-aware DSE framework. We demonstrate how this can be used to perform first-of-their-kind evaluations, including an isolation study to identify the individual impact of the flexibility axes. We demonstrate that adding flexibility features to a hypothetical DNN accelerator designed in 2014 improves runtime on future (i.e., present-day) DNNs by 11.8x geomean. img_4.png

MAGMA: An Optimization Framework for Mapping Multiple DNNs on Multiple Accelerator Cores

April 28, 2022

Heterogenous, Scheduling, Multi-core, multi-tenancy, Genetic Algorithm, Performance modeling, HPCA'22, Atlanta, GA


As Deep Learning continues to drive a variety of applications in edge and cloud data centers, there is a growing trend towards building large accelerators with several subaccelerator cores/chiplets. This work looks at the problem of supporting multi-tenancy on such accelerators. In particular, we focus on the problem of mapping jobs from several DNNs simultaneously on an accelerator. Given the extremely large search space, we formulate the search as an optimization problem and develop an optimization framework called M3E. In addition, we develop a specialized optimization algorithm called MAGMA with custom operators to enable structured sampleefficient exploration. We quantitatively compare MAGMA with several state-of-the-art methods, black-box optimization, and reinforcement learning methods across different accelerator settings (large/small accelerators) and different subaccelerator configurations (homogeneous/heterogeneous), and observe MAGMA can consistently find better mappings. img_2.png

DNNFuser: Generative Pre-Trained Transformer as a Generalized Mapper for Layer Fusion in DNN Accelerators

January 02, 2022

Transformers, RLs, GPT, Teacher-Student model, Knowledge Generalization, Performance Modeling, Pytorch, arXiv'22, Atlanta, GA


Dataflow/mapping decides the compute and energy efficiency of DNN accelerators. Many mappers have been proposed to tackle the intra-layer map-space. However, mappers for inter-layer map-space (aka layer-fusion map-space), have been rarely discussed. In this work, we propose a mapper, DNNFuser, specifically focusing on this layer-fusion map-space. While existing SOTA DNN mapping explorations rely on search-based mappers, this is the first work, to the best of our knowledge, to propose a one-shot inference-based mapper. We leverage Transformer as our DNN architecture to learn layer-fusion optimization as a sequence modeling problem. Further, the trained DNNFuser can generalize its knowledge and infer new solutions for unseen conditions. Within one inference pass, DNNFuser can infer solutions with compatible performance to the ones found by a highly optimized search-based mapper while being 66x-127x faster. img_7.png

DiGamma: Domain-aware Genetic Algorithm for HW-Mapping Co-optimization for DNN Accelerators

January 02, 2022

Hardware-Mapping Co-optimization, Domain-specific optimziation, Genetic Algorithm, DATE'22, Atlanta, GA


The design of DNN accelerators includes two key parts: HW resource configuration and mapping strategy. Intensive research has been conducted to optimize each of them independently. Unfortunately, optimizing for both together is extremely challenging due to the extremely large cross-coupled search space. To address this, in this paper, we propose a HW-Mapping co-optimization framework, an efficient encoding of the immense design space constructed by HW and Mapping, and a domain-aware genetic algorithm, named DiGamma, with specialized operators for improving search efficiency. We evaluate DiGamma with seven popular DNNs models with different properties. Our evaluations show DiGamma can achieve (geomean) 3.0x and 10.0x speedup, comparing to the best-performing baseline optimization algorithms, in edge and cloud settings. img_3.png

FLAT: An Optimized Dataflow for Mitigating Attention Bottlenecks

September 02, 2021

Transformer, Long-sequence Attention, NLP, Mapping/Dataflow, Performance modeling, Roofline Analysis, arXiv'21, Atlanta, GA


Attention mechanisms, primarily designed to capture pairwise correlations between words, have become the backbone of machine learning, expanding beyond natural language processing into other domains. This growth in adaptation comes at the cost of prohibitively large memory requirements and computational complexity, especially at higher number of input elements. This limitation is due to inherently limited data reuse opportunities and quadratic growth in memory footprints, leading to severe memory-boundedness and limited scalability of input elements. This work addresses these challenges by devising a tailored dataflow optimization, called FLAT, for attention mechanisms without altering their functionality. This dataflow processes costly attention operations through a unique fusion mechanism, transforming the memory footprint quadratic growth to merely a linear one. To realize the full potential of this bespoke mechanism, we propose a tiling approach to enhance the data reuse across attention operations. Our method both mitigates the off-chip bandwidth bottleneck as well as reduces the on-chip memory requirement. Across a diverse range of models, FLAT delivers 1.94x (1.76x) speedup and 49% and (42%) of energy savings compared to the state-of-the-art edge (cloud) accelerators with no customized dataflow optimization. Our evaluations demonstrate that state-of-the-art DNN dataflows applied to attention operations reach the efficiency limit for inputs above 512 elements. In contrast, FLAT unblocks transformer models for inputs with up to 64 K elements in edge and cloud accelerators. img_6.png

FRAME: Fast Roofline Analytical Modeling and Estimation

August 20, 2021

Performance modeling, Architecture, System modeling, DNN Workloads, Roofline Analysis, Open-source, Atlanta, GA


Frame is a roofline cost model for DNN accelerators. We support CNNs, MLPs, and Transformers workload. What it does:

  • Given DNN accelerator system information (using the System class in src/, where you can specify PE array shape (mxu_shape), on-chip BWs, off-chip BWs, etcs.
  • Given DNN workload (e.g., model='vgg16') FRAME generate a table of layer-wise latency and memory usage information as well as a roofline figure. IMAGE ALT TEXT HERE

E3: A HW/SW Co-design Neuroevolution Platform for Autonomous Learning in Edge Device

May 02, 2021

FPGA, Algortihm-HW co-design, Evolution Strategy, Neural Architecure Search, Edge ML, ISPASS'21, Atlanta, GA


The true potential of AI can be realized once we move beyond supervised training using labelled datasets on the cloud to autonomous learning on edge devices. While techniques like Reinforcement Learning are promising for their autonomous learning ability, they exhibit high compute and memory requirements due to gradient computations, making them prohibitive for edge deployment. In this paper, we propose E3, a HW/SW co-designed edge learning system on a FPGA. E3 uses a gradient-free approach called neuro-evolution (NE) to evolve the neural network (NN) topology and weights dynamically. The NNs evolved using NE are highly irregular, and a population of such NNs need to be evaluated quickly in order for the NE algorithm to make progress. To address this, we develop INAX, a specialized accelerator inside E3 for efficient irregular network computation. INAX leverages multiple avenues of parallelism both within and across the evolved NNs. E3 shows averaged 30× speedup than CPU-based solution across a suite of OpenAI environments. img_5.png

GAMMA: Automating the HW Mapping of DNN Models on Accelerators via Genetic Algorithm

July 02, 2020

Genetic Algorithm, Blackbox Optimization, ML, Mapping/Dataflow, Architecture, Design Space Exploration, Open-source release, ICCAD'20, Atlanta, GA


DNN accelerators provide efficiency by leveraging reuse of activations/weights/outputs during the DNN computations to reduce data movement from DRAM to the chip. The reuse is captured by the accelerator’s dataflow. While there has been significant prior work in exploring and comparing various dataflows, the strategy for assigning on-chip hardware resources (i.e., compute and memory) given a dataflow that can optimize for performance/energy while meeting platform constraints of area/power for DNN(s) of interest is still relatively unexplored. The design-space of choices for balancing compute and memory explodes combinatorially, as we show in this work (e.g., as large as O(10^(72)) choices for running \mobilenet), making it infeasible to do manual-tuning via exhaustive searches. It is also difficult to come up with a specific heuristic given that different DNNs and layer types exhibit different amounts of reuse. In this paper, we propose an autonomous strategy called ConfuciuX to find optimized HW resource assignments for a given model and dataflow style. ConfuciuX leverages a reinforcement learning method, REINFORCE, to guide the search process, leveraging a detailed HW performance cost model within the training loop to estimate rewards. We also augment the RL approach with a genetic algorithm for further fine-tuning. ConfuciuX demonstrates the highest sample-efficiency for training compared to other techniques such as Bayesian optimization, genetic algorithm, simulated annealing, and other RL methods. It converges to the optimized hardware configuration 4.7 to 24 times faster than alternate techniques. img.png

ConfuciuX: Autonomous Hardware Resource Assignment for DNN Accelerators using Reinforcement Learning

May 01, 2020

RL, ML, Architecture, Baesian Optimization, Simulated Annealing, Genetic Algorithm, PyTorch, Tensorflow, Open-source release, MICRO'20, Atlanta, GA


DNN layers are multi-dimensional loops that can be ordered, tiled, and scheduled in myriad ways across space and time on DNN accelerators. Each of these choices is called a mapping. It has been shown that the mapping plays an extremely crucial role in overall performance and efficiency, as it directly determines the amount of reuse that the accelerator can leverage from the DNN. Moreover, instead of using a fixed mapping for every DNN layer, research has revealed the benefit of optimizing per-layer mappings. However, determining the right mapping, given an accelerator and layer is still an open question. The immense space of mappings (or map-space) makes brute-forced exhaustive search methods unapproachable. In this paper, we propose a domain-specific genetic algorithm-based method, GAMMA, which is specially designed for this HW-mapping problem. In contrast to prior works that either target simple rigid accelerators with a limited map-space or choose from a restricted set of mappings, we construct an extremely flexible map-space and show that GAMMA can explore the space and determine an optimized mapping with high sample efficiency. We quantitatively compare GAMMA with many popular optimization methods and observe GAMMA consistently finds better solutions. img_1.png