00:00:00

Survey Efficiency | Full Stack Optimization of Transformer

https://dsdanielpark.github.io https://github.com/dsdanielpark

Survey Efficiency | Full Stack Optimization of Transformer

MinWoo(Daniel) Park | Tech Blog

Created: 2023-12-11 18:02:21 +0000

Last modified: 2024-09-05 20:56:50 +0900

Survey Efficiency | Full Stack Optimization of Transformer

Related Project: Private
Category: Paper Review
Date: 2023-12-12

Full Stack Optimization of Transformer Inference: a Survey

url: https://arxiv.org/abs/2302.14017
pdf: https://arxiv.org/pdf/2302.14017
abstract: Recent advances in state-of-the-art DNN architecture design have been moving toward Transformer models. These models achieve superior accuracy across a wide range of applications. This trend has been consistent over the past several years since Transformer models were originally introduced. However, the amount of compute and bandwidth required for inference of recent Transformer models is growing at a significant rate, and this has made their deployment in latency-sensitive applications challenging. As such, there has been an increased focus on making Transformer models more efficient, with methods that range from changing the architecture design, all the way to developing dedicated domain-specific accelerators. In this work, we survey different approaches for efficient Transformer inference, including: (i) analysis and profiling of the bottlenecks in existing Transformer architectures and their similarities and differences with previous convolutional models; (ii) implications of Transformer architecture on hardware, including the impact of non-linear operations such as Layer Normalization, Softmax, and GELU, as well as linear operations, on hardware design; (iii) approaches for optimizing a fixed Transformer architecture; (iv) challenges in finding the right mapping and scheduling of operations for Transformer models; and (v) approaches for optimizing Transformer models by adapting the architecture using neural architecture search. Finally, we perform a case study by applying the surveyed optimizations on Gemini, the open-source, full-stack DNN accelerator generator, and we show how each of these approaches can yield improvements, compared to previous benchmark results on Gemini. Among other things, we find that a full-stack co-design approach with the aforementioned methods can result in up to 88.7x speedup with a minimal performance degradation for Transformer inference.

Contents

Full Stack Optimization of Transformer Inference: a Survey

TL;DR

효율적인 Transformer 인퍼런스를 위한 하드웨어-소프트웨어 최적화 방법

딥러닝 모델의 효율적 계산과 메모리 집약적 워크로드의 엣지 디바이스 배포.
Transformer 모델의 런타임 특성 분석과 최적화 방법 제시.
Gemini 기반 사례 연구를 통한 하드웨어-소프트웨어 협력 최적화.

1. 서론

딥러닝 모델은 수십억 개의 파라미터와 수십억 회의 Multiply-Accumulate (MAC) 연산을 수행합니다. 이런 연산은 주로 자원이 제한된 엣지 디바이스에서 사용되며, 실시간 지연 제한을 만족시켜야 합니다. CPU와 GPU는 유연한 계산 플랫폼으로 널리 사용되지만, 딥러닝 모델의 특정 연산 패턴을 최적화하기에는 비효율적입니다. 이런 이유로 딥러닝을 위한 하드웨어 가속기가 개발되었습니다.

최근 Transformer와 대규모 언어모델의 인기가 높아지면서 이들 모델의 효율적 인퍼런스를 위한 하드웨어 설계와 최적화가 중요해졌습니다. 이 논문은 Transformer 모델의 런타임 특성을 분석하고, 다양한 최적화 방법을 검토하며, Gemini 기반 사례 연구를 통해 이를 적용한 결과를 제시합니다.

2. Transformer 모델 아키텍처 및 성능 병목 현상

2.1 Transformer 모델 개요

Transformer는 Multi-Head Attention (MHA) 모듈과 Feed-Forward Network (FFN) 모듈로 구성됩니다. 입력 시퀀스는 각 토큰이 벡터로 표현된 행렬로 처리됩니다. MHA 모듈에서는 쿼리, 키, 값 행렬을 통해 어텐션 스코어를 계산하고, FFN 모듈에서는 두 개의 선형 계층을 통해 입력 벡터를 변환합니다.

2.2 모델 분석

2.2.1 작업량 분석

Transformer 모델의 FLOPs와 메모리 연산(MOPs)을 분석하여 병목 현상을 파악합니다. 시퀀스 길이에 따른 FLOPs와 MOPs는 아래와 같이 계산됩니다.

\[\text{Arithmetic Intensity} = \frac{\text{FLOPs}}{\text{MOPs}}\]

이를 통해 BERT-Base, BERT-Large, GPT-2 모델의 시퀀스 길이에 따른 계산 복잡도를 분석합니다.

2.2.2 프로파일링

Intel CPU에서 Transformer 모델의 프로파일링을 통해 병목 현상을 분석합니다. 짧은 시퀀스에서는 프로젝션 계층이 주요 계산을 차지하지만, 긴 시퀀스에서는 act-to-act 행렬 곱셈이 주요 병목으로 작용합니다. 이를 통해 모델의 전반적인 지연 시간을 평가합니다.

3. 하드웨어 설계

3.1 전형적인 DNN 가속기 개요

전형적인 DNN 가속기는 오프칩 DRAM, 온칩 글로벌 버퍼, PE 배열, NoC로 구성됩니다. 데이터 재사용을 극대화하기 위해 공간적, 시간적 데이터 흐름을 채택합니다.

3.2 Transformer를 위한 DNN 가속기 적응

Transformer 모델은 CNN 모델과 다른 메모리 계층 구조와 대역폭 요구 사항을 가지므로, 가속기를 재설계해야 합니다. 특히 비선형 함수 계산은 전용 하드웨어 지원이 필요합니다.

3.3 분석 모델링

분석 모델링을 통해 Transformer 인퍼런스의 런타임 특성을 예측합니다. Gemini 기반 가속기를 모델링하여 BERT-Base와 GPT-2의 시퀀스 길이에 따른 지연 시간과 비이상적 산술 강도를 계산합니다.

\[\text{Non-ideal Arithmetic Intensity} = \frac{\text{FLOPs}}{\text{MOPs (Non-ideal)}}\]

이를 통해 실제 하드웨어의 제약 조건을 반영한 성능 예측을 수행합니다.

3.4 Transformer 가속기 사례 연구

Gemini 기반 CNN 가속기를 Transformer 워크로드에 최적화하기 위해 메모리 계층 구조를 조정하고, 비선형 연산을 효율적으로 처리할 수 있도록 설계 변경을 수행합니다. 이를 통해 BERT 인퍼런스 성능을 39.6배 향상시켰습니다.

4. 모델 최적화

4.1 양자화

양자화는 모델의 파라미터와 활성화를 저비트 Precision로 변환하여 메모리 소비와 연산 비용을 줄입니다. 이를 통해 ALU의 크기, 지연 시간, 에너지 소비를 줄이고, 정수 전용 하드웨어에서 모델을 배포할 수 있습니다.

\[Q(r) = \text{Int}\left(\frac{r}{S}\right) + Z\]

4.2 희소성

희소성은 불필요한 파라미터를 제거하여 모델의 연산 비용을 줄입니다. 구조화된 희소성은 특정 구조의 파라미터를 제거하여 메모리 및 연산 효율성을 높입니다.

4.3 Transformer 특화 최적화

Transformer의 어텐션 메커니즘과 비선형 연산을 최적화하여 인퍼런스 성능을 개선합니다. 동적 희소성 패턴을 활용한 어텐션 최적화와 함수 근사화를 통한 비선형 연산 최적화가 주요 방법입니다.

5. 하드웨어에 트랜스포머 매핑

문제 정의: 트랜스포머 블록을 특정 하드웨어 아키텍처에 매핑하는 과정에서 발생하는 성능 최적화 문제.
핵심 결정 사항: 그래프 레벨과 연산 레벨에서의 매핑 결정 사항, 데이터 흐름 및 타일링 등.
성능 모델링: 다양한 매핑 성능을 예측하기 위한 모델들, 분석적 모델과 데이터 기반 모델의 비교.

트랜스포머 블록을 목표 하드웨어 아키텍처에 실행하기 위해서는 해당 연산과 통신을 수행하는 하드웨어 명령어로 매핑해야 합니다. 매핑 과정에서의 선택은 성능에 중요한 영향을 미치며, 최적 매핑을 찾기 위해서는 신중한 탐색, 휴리스틱 또는 학습 기반 접근 방식이 필요합니다. 이 논문에서는 매핑 문제를 소개하고, 트랜스포머의 효율적 실행을 위한 주요 매핑 결정을 논의하며, 기존 매핑 기법의 분류와 매핑 성능 모델링 기법을 개괄하고, 트랜스포머에 특화된 매핑 고려사항을 제시합니다.

5.1 매핑이란 무엇인가?

매핑 또는 스케줄은 특정 하드웨어 아키텍처에서 일련의 연산을 실행하기 위한 하드웨어 명령어의 순서로 정의됩니다. 예를 들어, Gemini와 같은 시스톨릭 배열 가속기에서는 특정 데이터 흐름에 따른 밀집 행렬 곱셈 및 DRAM과 로컬 SRAM 간 데이터 이동을 포함할 수 있습니다. 매핑은 하드웨어에서 실행 가능한 소스 코드 또는 컴파일된 바이너리를 생성하기 위해 데이터 및 메모리 명령어의 전체 순서를 나열합니다.

같은 문제에 대해 다른 매핑 결정을 적용하면 성능 면에서 큰 차이를 보일 수 있습니다. 따라서 매핑 또는 스케줄링 프레임워크의 목표는 주어진 소프트웨어 연산 및 원하는 성능 지표에 대해 Pareto-최적의 유효한 매핑을 얻는 것입니다.

5.2 주요 매핑 결정 사항

매핑은 두 단계로 이루어집니다. 첫째, 그래프 수준에서 그래프를 텐서 연산 집합으로 변환합니다. 둘째, 각 텐서 연산을 적절한 크기의 하드웨어 명령어로 스케줄링합니다.

5.2.1 그래프 레벨

그래프 레벨 스케줄링은 연산의 실행 스케줄뿐만 아니라 계산 그래프의 구조를 변경하는 결정을 포함하며, 주요 결정 사항은 다음과 같습니다.

레이어 융합: 여러 레이어를 하나의 텐서 연산으로 결합하여 실행. 이는 주 메모리에 쓰고 읽는 대신 결과를 칩에 유지함으로써 층간 통신을 줄이는 효과가 있습니다.
동적 희소화: 활성화 맵에 따라 가지치기 결정을 내리는 방식으로, 작을 가능성이 높은 내적을 제로화하여 연산량을 줄이는 방법입니다.
정적 희소화: 활성화와 독립적으로 가지치기 결정을 내리는 방식으로, 구조적 희소화는 높은 속도 향상을 가져오지만 정확도 저하가 발생할 수 있습니다.

5.2.2 연산 레벨

연산 레벨 스케줄링은 텐서 연산을 주어진 아키텍처에서 실행할 작업 세트로 분해하는 과정입니다.

타일링: 메모리 계층의 다른 레이어에 맞도록 연산을 타일로 나누는 것. 타일 크기는 선택 사항입니다.
데이터 흐름 결정: 타일이 실행되는 순서와 텐서가 프로세서에 걸쳐 이동되는 순서를 결정하는 것.
시공간 매핑: 어느 축을 병렬 처리하고 순차적으로 실행할지를 결정하는 것.
통신과 계산의 상호작용: 이중 버퍼링 등의 방법으로 지연 시간을 최소화하는 것.
하드웨어 명령어로 산술 명령어 매핑: 일부 아키텍처에서는 적절한 크기의 행렬 곱셈을 ISA 명령어 호출로 대체하는 것처럼 간단할 수 있습니다.

이런 선택 사항은 성능에 큰 영향을 미치며, 특정 하드웨어 목표에 대해 비용을 최소화하기 위해 선택해야 합니다.

5.3 성능 높은 매핑 찾기

탐색 공간의 크기를 다루기 위해 다양한 매핑 기법과 컴파일러 프레임워크가 개발되었습니다. 이들 기법은 부분적 탐색 공간만을 다루며, 다양한 탐색 전략을 사용합니다.

5.3.1 매핑 전략

매핑 알고리즘은 일반적으로 세 가지 범주로 나뉩니다.

무차별 대입 방법: 지도자가 매핑 공간을 샘플링하여 최적 매핑을 찾는 방법.
피드백 기반 방법: ML 알고리즘 또는 통계적 방법을 사용하여 비용 모델의 정확도를 개선하거나 솔루션을 직접 검색하는 방법.
제약 최적화 방법: 제약과 목적함수를 사용하여 변수를 할당하는 최적화 문제로 스케줄링 문제를 공식화하는 방법.

5.4 매핑의 성능 모델링

성능 모델은 실제 하드웨어에서 매핑을 실행하지 않고도 다양한 매핑의 성능 피드백을 제공합니다. 이는 평가 비용을 줄이고, 매핑 최적화를 위한 성능 프록시로 사용될 수 있습니다.

트랜스포머의 경우, 도메인 특화 다항식 모델과 분석 모델이 매핑 간 비교를 위한 빠른 예측을 제공합니다. 데이터 기반 ML 모델은 매핑 성능 데이터를 학습하여 정확한 예측을 제공합니다.

5.5 트랜스포머와 CNN 매핑의 차이점

기존 매핑 전략은 주로 CNN에 초점을 맞추고 있지만, 트랜스포머도 유사한 연산(행렬 곱셈)을 포함하므로 CNN 매핑 전략을 적용할 수 있습니다. 그러나 트랜스포머 블록은 LayerNorm과 Softmax와 같은 추가 연산을 포함하여 스케줄링 최적화가 더욱 복잡해집니다.

5.5.1 트랜스포머의 맵스페이스 특성화

BERT와 ResNet50의 맵스페이스를 비교한 결과, 두 맵스페이스의 분포는 유사하며, 트랜스포머 매핑도 CNN 매핑만큼 어렵다는 것을 확인했습니다. 이 결과는 브루트포스나 무작위 탐색이 트랜스포머 매핑에서도 동일하게 어려움을 겪는다는 것을 시사합니다.

5.5.2 LayerNorm 및 Softmax의 스케줄링 복잡성

트랜스포머의 LayerNorm과 Softmax는 고아리스틱 연산과의 융합을 통해 지연 시간을 숨길 수 있지만, 이는 하드웨어 지원과 적절한 매핑 제약을 필요로 합니다. 실험 결과, MHA의 matmul과 Softmax의 융합은 지연 시간을 최대 78%까지 줄일 수 있지만, FFN의 matmul과 LayerNorm의 융합은 성능을 저하시키는 경우가 많았습니다.

트랜스포머 매핑은 CNN 매핑만큼 어려우며, LayerNorm과 Softmax와 같은 비선형 연산이 추가적인 복잡성을 야기합니다. 효율적인 트랜스포머 매핑을 위해서는 하드웨어와 연산의 특성을 고려한 신중한 매핑 전략이 필요합니다.

6. 트랜스포머로의 NAS 적용 검토

6.1. 신경망 아키텍처 탐색 (NAS) 개요

딥 뉴럴 네트워크(DNN)의 구조를 최적화하는 과정에서 NAS는 특히 중요한 기법으로 자리 잡고 있습니다. NAS는 최적의 DNN 아키텍처를 자동으로 설계하기 위한 프레임워크로, 특정 작업에 대한 최대 정확도와 하드웨어 성능 요구사항을 모두 고려합니다. 이 과정은 일반적으로 세 가지 주요 구성요소로 나뉩니다. 탐색 공간(search space), 탐색 방법(search method), 평가 방법(evaluation method)입니다. 이 중 탐색 공간은 가능한 DNN 아키텍처를 정의하며, 탐색 방법은 이 공간을 효과적으로 탐색하고, 평가 방법은 후보 아키텍처의 성능을 평가하는 데 사용됩니다.

6.1.2 탐색 방법

NAS에서 탐색 공간은 방대하기 때문에 효율적인 탐색 방법이 필수적입니다. 초기 NAS 연구에서는 강화학습(RL)을 기반으로 한 탐색 방법이 주로 사용되었습니다. 이 방식은 RL 에이전트가 아키텍처를 샘플링하고, 훈련 후의 정확도를 보상으로 사용하여 에이전트의 정책을 개선하는 방식입니다. 하지만, 이런 방법은 계산 비용이 높습니다. 이를 해결하기 위해 진화 알고리즘과 같은 대안적인 방법이 도입되었고, 최근에는 DARTS와 같은 그래디언트 기반 최적화 방법을 사용하여 연속적인 탐색 공간에서 효율적으로 탐색할 수 있는 방법이 제안되었습니다.

6.1.3 가중치 공유와 슈퍼넷

NAS의 주요 챌린지 중 하나는 긴 훈련 시간과 높은 계산 비용입니다. 이를 극복하기 위해 제안된 방법 중 하나는 가중치를 공유하는 것입니다. ENAS는 대규모 오버 파라미터화된 슈퍼넷에서 개별 후보 DNN을 하위 네트워크로 간주하고, 이 슈퍼넷의 가중치를 모든 하위 네트워크와 공유합니다. 이렇게 하면 하나의 슈퍼넷을 훈련시키고, 다양한 하위 네트워크를 샘플링하여 성능을 평가할 수 있으며, 계산 비용을 크게 줄일 수 있습니다.

6.2 하드웨어 인식 NAS

하드웨어 인식 NAS는 DNN의 정확도뿐만 아니라 특정 하드웨어 플랫폼에서의 성능(e.g., 지연 시간, 에너지 소비, 메모리 사용)을 최적화합니다. 이를 위해 하드웨어 성능을 직접 측정하거나, 성능을 예측할 수 있는 모델을 훈련하는 등의 방법이 사용됩니다. 예를 들어, MNasNet은 RL 기반 NAS 프레임워크를 확장하여 정확도를 최대화하면서 지연 시간을 특정 목표 이하로 제한하는 다목적 최적화 문제를 해결합니다.

6.3 트랜스포머 특화 NAS

트랜스포머 아키텍처는 주로 NLP 작업에 사용되었지만, 최근에는 컴퓨터 비전 및 기타 분야에서도 그 효용성이 증명되었습니다. 초기 트랜스포머 아키텍처를 위한 NAS 연구는 Evolved Transformer와 같은 진화 알고리즘을 사용하여 더 나은 트랜스포머 아키텍처를 탐색하는 데 중점을 두었습니다. 이 연구들은 하드웨어 성능을 직접 최적화하면서 정확도를 개선할 수 있는 다양한 트랜스포머 구조를 발견했습니다.

6.4 사례 연구: NAS와 공동 설계를 통한 트랜스포머 최적화

사례 연구에서는 Gemini 하드웨어에서 트랜스포머의 인퍼런스를 최적화하는 NAS 방법을 적용했습니다. 이 과정에서 다양한 아키텍처를 탐색하고, 실험을 통해 에너지-지연 곱(EDP)과 같은 하드웨어 비용 지표를 기준으로 파레토 최적 아키텍처를 발견했습니다. 이 연구는 NAS가 트랜스포머의 성능과 하드웨어 비용 간의 최적의 균형을 찾는 데 얼마나 효과적인지를 보여줍니다.

이 논문은 Transformer 모델의 효율적 인퍼런스를 위한 하드웨어-소프트웨어 협력 최적화 방법을 제시하고, Gemini 기반 사례 연구를 통해 이를 검증하였습니다. 분석 모델링과 실제 하드웨어 프로파일링을 통해 병목 현상을 파악하고, 다양한 최적화 기법을 적용하여 성능을 향상시켰습니다.

[참고자료 1] DARTS

DARTS (Differentiable Architecture Search)는 신경망 구조를 최적화하는 그래디언트 기반 방법으로 신경망의 구조를 직접적으로 학습가능한 연속적인 공간으로 매개화하여, 최적의 구조를 효율적으로 검색할 수 있도록 설계되었습니다.

DARTS는 구조 검색의 복잡성을 크게 줄이면서도, 효과적인 신경망 아키텍처를 도출할 수 있는 기법으로 평가받고 있습니다.

DARTS의 기본 개념

DARTS는 연산 선택을 연속적인 선택으로 변환하여 네트워크 아키텍처를 최적화하는 문제를 미분 가능한 문제로 변환합니다. 각 연산의 가중치는 소프트맥스 함수를 통해 정규화되어, 여러 가능한 연산 중 어떤 것이 최적인지를 결정할 수 있게 합니다.

1. 연산 선택의 매개화

신경망의 각 계층에서 선택할 수 있는 연산 \(o\)가 여러 개 있을 때, 각 연산에 대해 가중치 \(\alpha\)를 도입합니다. \(\alpha\)는 연속적인 변수로, 각 연산의 중요도를 나타냅니다. 연산의 선택은 다음과 같이 연속적인 softmax 함수로 매개화됩니다.

\[o^{(i,j)}(x) = \sum_{o \in \mathcal{O}} \frac{\exp(\alpha_o^{(i,j)})}{\sum_{o' \in \mathcal{O}} \exp(\alpha_{o'}^{(i,j)})} o(x)\]

\(\mathcal{O}\)는 가능한 연산의 집합, \(i, j\)는 각각 입력과 출력 노드의 인덱스, \(x\)는 입력 데이터이며, \(\alpha_o^{(i,j)}\)는 노드 \(i\)에서 \(j\)로의 연산 \(o\)의 가중치를 나타냅니다.

2. 목적함수와 그래디언트 계산

DARTS의 목적함수는 주로 검증 데이터셋에서의 성능을 최대화하는 것입니다. 목적함수를 \(L_{\text{val}}(\theta, \alpha)\)라 할 때, \(\theta\)는 네트워크의 가중치, \(\alpha\)는 연산 선택의 가중치입니다. 목적함수를 최소화하기 위해 \(\alpha\)와 \(\theta\)에 대한 그래디언트를 계산하고 이를 기반으로 파라미터를 업데이트합니다. (\(L_{\text{train}}\)은 훈련 데이터셋에 대한 손실 함수)

\[\theta^* = \arg\min_{\theta} L_{\text{train}}(\theta, \alpha)\] \[\alpha^* = \arg\min_{\alpha} L_{\text{val}}(\theta, \alpha)\]

3. 이중 최적화

DARTS에서는 \(\theta\)와 \(\alpha\)에 대한 최적화를 번갈아 수행합니다. \(\theta\)에 대해서는 일반적인 역전파를 사용하여 네트워크 가중치를 업데이트하고, \(\alpha\)에 대해서는 검증 데이터셋에 대한 성능을 기준으로 업데이트합니다. 이 과정은 다음과 같이 표현할 수 있습니다. (\(\eta_{\theta}\), \(\eta_{\alpha}\)는 각각 \(\theta\)와 \(\alpha\)에 대한 학습률)

\[\theta \leftarrow \theta - \eta_{\theta} \nabla_{\theta} L_{\text{train}}(\theta, \alpha)\] \[\alpha \leftarrow \alpha - \eta_{\alpha} \nabla_{\alpha} L_{\text{val}}(\theta, \alpha)\]

DARTS의 접근 방식은 연산 선택을 연속적인 공간에서 수행함으로써 신경망 구조 탐색의 복잡성을 크게 줄이면서, 그래디언트 기반 최적화를 가능하게 합니다.

1 INTRODUCTION

Deep learning models have scaled up to billions of parameters and billions of Multiply-Accumulate (MAC) operations during both training and inference. As a result, there has been a growing interest in computing these models efficiently and in deploying these compute and memory-intensive workloads on resource-constrained edge devices. These edge devices have tight energy and memory constraints, and the corresponding applications that leverage deep learning models also often have real-time latency constraints.

CPUs and GPUs are both commonly used in general-performance computing platforms, and they have the advantage of being both ubiquitous and capable of supporting a wide variety of workloads and operations. However, this flexibility comes at a cost of reduced efficiency. Deep learning models are composed of a small number of distinct operations that are repeated millions or billions of times, and therefore they often do not require a high level of flexibility. Additionally, while modern CPUs and GPUs can perform several operations in parallel, they lack the ability to leverage the massive data reuse opportunities in deep learning models.

The combination of a need for fast, efficient computation, the use of a small number of distinct operations, and the opportunities for data reuse have all led to the use of hardware accelerators for deep learning. A multitude of enterprise deep learning accelerators, such as [1, 4, 62, 91, 100, 115, 134, 137, 171, 196, 208], have been developed and integrated into commodity hardware by industry in the past decade. This parallels many research accelerators developed in academia [34, 37, 39, 40, 59, 69, 70, 81, 169]. Together with hardware accelerator development, the software frameworks [3, 32, 98, 167] and compilers [33, 161, 185] for deploying various deep learning algorithms have also enhanced and matured. These tools enable the execution of deep learning algorithms on accelerators, and they perform mapping optimizations to improve the performance and efficiency of the full deep learning pipeline. Nonetheless, the fast-evolving deep learning algorithms still keep introducing new demands for hardware and software support, as well as their cooptimization, to satisfy various deployment constraints.

The recent rise in popularity of Transformers and large language models [22, 44, 52, 58, 86, 173–175, 177, 190, 198] for solving various natural language processing (NLP) tasks presents a brand new set of challenges in the design of accelerators as well as frameworks. There has also been an increased focus on making Transformer inference more efficient, especially due to their growing size and run-time complexity. However, there is still a lack of understanding regarding the workload characteristics of Transformer architectures, and thus of the design principles necessary for effectively running these models, when compared to the more well-known convolutional neural network (CNN) architectures. For instance, compared to the conventional CNN-focused design, Transformers are mostly composed of matrix multiplications (matmuls) together with memory-intensive nonlinear operations. In addition, the computational graph and dataflow of Transformer models are more complex than that of CNNs, with more types of operation nodes, as well as more dataflow splits and concatenations. All these challenges require us to undertake a comprehensive analysis of the current hardware and software solutions as well as the various design trade-offs for Transformer inference. Performing such an analysis will enable us to build a holistic and comprehensive understanding of the requirements for efficiently running Transformers. The contribution of this work is two-fold: (1) to analyze the run-time characteristics of Transformers and to survey different approaches for efficient Transformer inference; and (2) to perform a case study by applying the surveyed methodologies on Gemini [70], the full-stack deep neural network (DNN) accelerator generator. The longer-term goal of this work is to characterize different factors across the hardware and software stack in order to optimize Transformer inference.

Regarding our first contribution, this paper contains a survey and analysis covering different hierarchies in end-to-end deep learning inference, with a particular focus on Transformers. This includes:

Analysis and profiling of the runtime characteristics and bottle-necks of the Transformer architecture (Sec. 2).
Hardware architectures for Transformer inference, including the impact of the non-linear operations of the Transformer architecture on their design (Sec 3).
Optimization strategies such as pruning and quantization for further improving the performance of a fixed Transformer architecture (Sec 4).
Mapping and scheduling of operations in the Transformer architecture and the associated challenges (Sec. 5).
Designing and adapting Transformer architectures to be more hardware efficient through an automated neural architecture search process (Sec. 6).

Regarding our second contribution, our case study of applying the surveyed methodologies on deploying Transformers yields several key findings, including the following:

Gemini, which was originally designed for CNN workloads, does not yield hardware accelerator architectures that are wellsuited for Transformer inference. The primary bottleneck for running Transformers on CNN domain-specific accelerators is not necessarily linear operations, but rather it is the time spent on floating-point non-linear operations, as well as quantization and dequantization operations. Unless those operations are addressed properly, this can result in less than 1% hardware utilization (Sec. 3.4 and Fig. 14).
For Transformer accelerators, it is often better to have a larger accumulator size and smaller scratchpad size, while the opposite is often more optimal for CNN accelerators. Changing accelerator architecture to incorporate this observation can result in a 36% latency improvement over the baseline optimized for CNN benchmarks (Sec. 3.4.3).
Despite the fact that scheduling matmuls in Transformers only requires 3 loops, as compared to 6 for convolutions in CNNs, we found that it is as challenging to find performant schedules for Transformers as it is for CNNs. The selection of appropriate scheduling decisions for Transformers involves a large number of decisions, with the best and worst solutions exhibiting performance differences of up to four orders of magnitude (Sec. 5.5.1 and Fig. 18, 19, 20).
Fusing Batch Normalization with the neighboring convolution in CNN models is straightforward. However, when fusing Layer Normalization with the preceding matmul in the Transformer architecture, constraints are imposed on the mapping, particularly related to tile sizes. This requires further consideration since the runtime cost due to the mapping constraints could outweigh the gains from operation fusion in certain circumstances (Sec. 5.5.2 and Fig. 21, 22).

2 TRANSFORMER MODEL ARCHITECTURE AND PERFORMANCE BOTTLENECKS

In this section, we start with a high level introduction to the building blocks of the Transformer architecture. We first discuss the multihead attention and feed-forward modules, the non-linear operations used in Transformers, and the difference between Encoder/Decoder models, in Sec. 2.1. We then analyze the impact of these different blocks on hardware performance using arithmetic, as well as the analytical modeling and direct profiling of each component, in Sec. 2.2.

2.1 High-Level Overview of Transformer

Architecture

A Transformer architecture [217] typically consists of multiple Transformer blocks, each of which includes a multi-head attention (MHA) module and a feed-forward (FFN) module, and each of which is followed by a Layer Normalization (LayerNorm) operation and a residual connection. The detailed computations of MHA and FFN are illustrated in Fig. 1, and the configuration parameters for Transformer architectures (along with the values used by BERTBase and BERT-Large) are provided in Tab. 1. An input sequence to the Transformer block is composed of 𝑙 tokens, each represented by a vector of 𝑑 dimension, forming a 𝑑 × 𝑙 matrix. A token is a segment of an input sequence. For example, when the input is a sentence, a token may be a word or a sentence fragment.

The MHA module (see Fig. 1, Left) first projects this sequence by multiplying it with three different weight matrices: 𝑊𝑄 , 𝑊𝐾 , and 𝑊𝑉 (the so-called query, key, and value matrices). This yields three different activations, namely the query, key, and value activations. The query, key, and value activations are then split into ℎ chunks, with each chunk having a hidden dimension of 𝑑/ℎ. These chunks are then forwarded to ℎ different attention heads, where the query and key chunks are multiplied along the hidden dimension, generating an activation matrix of size 𝑙 × 𝑙. This activation matrix is then passed through the Softmax operation (the output of which is often referred to as an attention score) and multiplied with the value chunk, resulting in an activation of hidden dimension 𝑑/ℎ. Subsequently, all of the activations from the attention heads are concatenated along the hidden dimension to generate a single activation of hidden dimension 𝑑, which is then projected into the same dimension by the last linear layer with the weight matrix 𝑊out. Finally, the output from the last linear layer in the MHA module is passed through the LayerNorm operator before being added to a residual connection to get the MHA module output.

Figure 1: Map of the computations performed in (Left) the multi-head attention (MHA) module and (Right) the feed-forward network (FFN) module in the Transformer encoder block.

Table 1: Configuration parameters for Transformer architectures. Parameters for BERT-Base, BERT-Large, and GPT-2 (smallest) are given as examples. Note that GPT-2 has the same parameters as BERTBase. Sequence length can be any number, as long as it doesn’t exceed the maximum possible sequence length.

Table 2: Linear operations in Transformer models. The last column is the matrix multiplication dimensions, i.e., 𝑚 × 𝑛 × 𝑘 means the input dimensions of 𝑚 × 𝑛 and 𝑛 × 𝑘, and the output dimension of 𝑚 × 𝑘. Note that act-to-act matmuls are both repeated ℎ times in the multi-headed scheme. The entire computation graphs of MHA and FFN are illustrated in detail in Fig. 1.

In summary, an MHA module consists of six linear operations, four of which are identical weight-to-activation matmuls (i.e., the 𝑊𝑄 ,𝑊𝐾 ,𝑊𝑉 and𝑊out projections), and the remaining two of which are activation-to-activation matmuls (i.e., query × key and attention score × value). Throughout this paper, we refer to the first type of matmuls as projections and the second type of matmuls as activationto-activation matmuls (act-to-act matmuls for short), as they have different run-time behaviors.

The FFN module (see Fig. 1, RIght) is a relatively simple block consisting of two linear layers. The input sequence is first projected from the hidden dimension 𝑑 to a higher FFN dimension 𝑑FFN via the first linear layer with the weight matrix 𝑊1. Subsequently, the projected sequence is projected back to the original dimension 𝑑 through the second linear layer with the weight matrix 𝑊2. Generally, the dimension 𝑑FFN is chosen to be 4× larger than 𝑑, resulting in the 4:1 aspect rate of 𝑊1 and 𝑊2 (e.g., in BERT-Base [52]). In between these two linear layers is a non-linear layer. Typically, GELU [85] is used for this [22, 52, 143, 173, 174]. Tab. 2 summarizes all types of linear layers in a Transformer block in both MHA and FFN modules.

2.1.1 Nonlinear Operations.

There are several nonlinear operations such as Softmax, LayerNorm, and GELU that require specialized support or off-chip computation. These nonlinear operations account for a relatively smaller portion of the overall operations, when inferring with Transformer networks, as compared to the linear operations (Sec. 2.2.2). However, they are more challenging to compute on typical hardware than matmuls, and they can incur significant overhead if not handled appropriately.

Figure 2: Diagrams outlining the Softmax, LayerNorm, and BatchNorm operations. Since they rely on runtime statistics, LayerNorm and Softmax both require multiple passes over the input in order to compute the nonlinear operation. In the case of Softmax, a first pass over the inputs is required to compute the denominator. For LayerNorm, three passes are required over the inputs: one to compute the mean; one to compute the standard deviation; and one to apply the normalization. Unlike LayerNorm and Softmax, BatchNorm only uses statistics which are learned during training, and therefore it only requires one pass over the inputs.

Figure 3: Variants of Transformer networks. (a) An encoder-only model, which performs inference for all tokens in parallel. (b) A decoder-only model, which performs inference in an auto-regressive manner. (c) An encoder-decoder model, which uses the output of the encoded sequence as input to a cross-attention module.

The nonlinear operations present challenges in terms of efficient utilization of temporal memory as well as efficient computation. This is because they require multiple passes over all input values, which requires those values to be held in temporal memory. As depicted in Fig. 2 (a), the Softmax operation involves (1) exponential operations, (2) summing up the results across the sequence length dimension, and (3) normalizing the input by dividing it by the summation result. It is also well known that the exponential function is prone to numerical overflow, prompting the use of the maximum subtraction trick \cite{151} that transforms the expression

\[\text{exp}(x_i) / \sum_j \text{exp}(x_j)\]

into

\[\text{exp}(x_i - x_{\text{max}}) / \sum_j \text{exp}(x_j - x_{\text{max}}),\]

where \(x_{\text{max}}\) is the maximum of the \(x_j\)’s. This, however, requires an additional pass over the inputs, resulting in a three-pass numerically stable implementation. Computing the LayerNorm function also requires multiple passes over the entire input values across the hidden dimension, as illustrated in Fig. 2 (b). In the first pass, the mean must be computed. In the second pass, this is then used to compute the standard deviation. Finally, in the third pass, where the normalization is actually applied, one division per input value is required.

Furthermore, the nonlinear operations entail challenges in operation fusing, which is a common technique to reduce interlayer communications by combining multiple operations into a single operation (Sec. 5.2.1). Unlike Batch Normalization (BatchNorm) in many CNN architectures that can be seamlessly subsumed into preceding or succeeding linear operations \cite{97}, LayerNorm requires computing the mean and variance of the inputs at runtime. Therefore, to fuse this operation with the preceding matmul operation, the entire output matrix must be accumulated in place across the reduction dimension (i.e., the dimension in which the mean and variance are computed) before writing out results. This leads to irregular tiling dimensions and lower data reuse. As a result, there is a nontrivial tradeoff between fusing these operations with previous layers versus using better tiling dimensions for maximizing reuse. A detailed analysis of this tradeoff will be provided later in Sec. 5.5.2.

2.1.2 Encoder and Decoder Architectures.

The Transformer architecture was originally introduced as an encoder-decoder model for machine translation tasks [217]. In this setting, the encoder takes the entire source language sentence as input and passes it through multiple Transformer encoder blocks, extracting the highlevel features of the input sentence. These extracted features are then fed into the decoder, which is responsible for generating the tokens in the target language, one after another. This is based on the source language features from the encoder as well as the tokens it has previously generated [217]. In subsequent works, encoderonly and decoder-only architectures were introduced, taking only the encoder and the decoder components, respectively, from the original encoder-decoder architecture [46, 174] (Fig. 3).

Encoder Block.

In encoder-only Transformer models [52, 143, 240], the input sequence is passed through the repeated encoder blocks all at once. For this reason, the encoder-only structure is suitable for natural language understanding tasks [52, 143], such as sentiment analysis [202] or sentence similarity analysis [28, 53, 96], where the entire input sequences are fed into the model.

In the encoder block, the inference is composed of matrix-matrix multiplications as well as element-wise additions and nonlinear operations. The cost of the projection layers in the MHA module and FFN module scales linearly with the input sequence length 𝑙. However, the act-to-act matmuls in the MHA module scale quadratically with sequence length (as demonstrated in query × key and attn. score × value rows in Tab. 2). In Sec. 2.2.2, we demonstrate via profiling that this depends on the sequence length: with short sequence lengths, the projection layers dominate, making the overall complexity of the encoder block 𝑂 (𝑙); with long sequence lengths, however, the act-to-act matmuls dominate, making the overall complexity 𝑂 (𝑙 2).

Decoder Block. In contrast to encoder-only models, the decoderonly models [22, 173, 174] that consist of repeated decoder blocks are auto-regressive in nature. This means that the output at a given time step is based on the outputs in the previous time steps. In other words, the model predicts a token in a sentence based on the previous tokens it has generated so far, and the inference must therefore be performed sequentially and iteratively, once for each output token. For instance, if the previously generated sequence is “I am a”, the model takes this as input and may predict the next token “student”. Then, in the next time step, the input to the model becomes “I am a student”. Therefore, the decoder-only structure is suitable for natural language generation tasks. It is important to note that, in decoder-only models, the input prompt tokens can be consumed in parallel before the model begins to generate subsequent tokens. For this work, we only consider open-ended generation (i.e., assuming no input prompt).

Unlike the encoder block, which operates on the entire input sequence, the decoder block is inferred one token at a time. This results in a sequence length of one for each time step. In the case of the projection layers, each token is independent of the previously generated token. Thus, the projection operations are solely applied to the input token, resulting in a matrix-vector multiplication and a constant cost. However, this does not hold for the act-to-act matmuls, as the input token is not independent of the previously generated tokens. Instead, it is required to attend to all of them.

Consequently, these operations scale linearly with sequence length, implying that more compute is required to process a token in a larger time step than a token in a smaller time step. A key detail to note is that the full key and value activations must be present for the input token to attend to all previously generated tokens. A common optimization technique for token generation is to cache and reuse the intermediate key and value of the previously generated tokens in subsequent iterations, thus avoiding the need to recompute them for every iteration. Taken together, the end-to-end complexity of generating the full sequence scales linearly for the projection layers and quadratically for the other two act-to-act matmuls. The end-toend computation graph of the Transformer decoder block is also provided in Fig. 27 of Appendix A.6.

Summary (Sec. 2.1. Transformer Overview) Transformers are composed of several Transformer blocks, each of which has an MHA (multi-head attention module and an FFN (feed-forward network) module (along with LayerNorm and residual addition after each module). The MHA module contains projection layers as well as act-to-act matmuls and Softmax operations. The FFN module consists of two projection layers with a nonlinear function between them. There are two types of Transformer blocks: encoder blocks and decoder blocks. Encoder blocks process the entire input sequence in parallel, making them suitable for natural language understanding tasks. Decoder blocks are autoregressive, meaning that inference must be performed once per generated output token, and are therefore typically used in generative tasks.

2.2 Model Analysis

2.2.1 Workload Analysis.

In order to evaluate bottlenecks in Transformer, we first modelled the number of floating-point operations (FLOPs) required to compute the Transformer encoder-only and decoder-only models, as well as the arithmetic intensity of these networks. Arithmetic intensity is the number of floating point operations that can be performed per byte loaded from memory. It can be computed by dividing the total number of FLOPs by the total number of bytes accessed (also referred to as MOPs, or memory operations) [227]:

Here, we are assuming that the local memories are large enough to hold both matrices entirely in memory for a given operation, and that the computed arithmetic intensity values therefore serve as an upper bound for the achievable data reuse. We are also counting the multiplication and addition from a MAC operation separately when computing FLOPs.

End-to-end FLOPs and MOPs. For the encoder analysis, we used the 12-layer BERT-Base model and the 24-layer BERT-Large network (see Tab. 1 for model configurations); and for the decoder, we used the 12-layer GPT-2 model architecture which has the same model configuration parameters as BERT-Base. For the purposes of analysis, we ignored the maximum input sequence lengths of

Figure 4: Plot of the FLOPs for the BERT-Base and BERT-Large encoders and the GPT-2 decoder across different sequence lengths. The FLOPs scales quadratically with sequence length due to quadratic scaling in the act-to-act matmuls as well as the Softmax function. Additionally, inferring the BERT-Base encoder and the GPT-2 decoder (which have the same model architecture) requires a similar number of FLOPs for processing the same sequence length.

Figure 5: Plot of the MOPs for the BERT-Base and BERT-Large encoders and the GPT-2 decoder across different sequence lengths. The MOPs scale quadratically with sequence length for the encoder-only models due to quadratic scaling in the act-to-act matmuls as well as the Softmax function. Additionally, the GPT-2 decoder requires a much greater number of MOPs than the BERT-Base encoder (which have the same model architecture) for processing the same sequence length as it loads weights per every token generation.

512 for standard BERT models throughout this section. We then computed MOPs, the number of bytes that had to be accessed, when inferring these models. We assumed 8-bit precision for all operations, meaning that loading one parameter or activation would require loading one byte. For the decoder model, we measured the FLOPs and MOPs as the total amount of floating point operations and memory operations needed to iteratively generate the full sequence of the given length. The FLOPs and MOPs for these networks for a range of sequence lengths are plotted in Fig. 4 and 5, respectively. As one can see, FLOPs and MOPs scale super-linearly for all models, especially in the long sequence length regime, due to the quadratic complexity with respect to sequence length in the act-to-act matmuls.

End-to-end Arithmetic Intensity. We then modeled the arithmetic intensity by dividing the number of FLOPs required when inferring these models by the number of MOPs. The arithmetic intensity for BERT-Base, BERT-Large, and GPT-2 versus sequence length is shown in Fig. 6. For both BERT-Base and BERT-Large, the arithmetic intensity initially increases with sequence length until 512 and then decreases afterwards for larger sequence lengths. The reason for this is that, as will be analyzed in more detail in Sec. 2.2.2, the FFN module that has higher arithmetic intensity than the MHA module (Tab. 3) dominates the total FLOPs for small sequences (Fig. 7). However, this trend reverses for larger sequence lengths, as the cost of act-to-act matmuls in the MHA module grow quadratically with the increase in sequence length, leading to a reduction in arithmetic intensity for the end-to-end model inference.

In comparison to encoder-only BERT inference, decoder-only GPT-2 inference exhibits significantly lower arithmetic intensity. This is due to the fact that the decoder is composed solely of matrixvector operations, which limits the opportunities for data reuse. That said, for a single matrix-vector operation, we perform roughly one multiplication and addition per parameter loaded since the loads cannot be shared across tokens. This leads to performing roughly 2 operations per parameter loaded. It is important to note that GPT-2 has fewer FLOPs than BERT-Base and BERT-Large as the sequence length is increased. However, it is typically more challenging to run its inference efficiently due to its low arithmetic intensity. This makes its performance memory bandwidth-bound, as compared to encoder-only BERT models. This behavior is also characterized in depth by [166].

Figure 6: Plot of the arithmetic intensity of the BERT-Base and BERT-Large encoders and the GPT-2 decoder across different sequence lengths. The arithmetic intensity initially increases since the larger matrix dimensions allow for more computations to be performed per parameter loaded. However, at higher sequence lengths the arithmetic intensity decreases. This is because, for the long sequence length, the act-to-act matmuls, and Softmax computations of the MHA module begin to dominate. These have relatively lower arithmetic intensity compared to the projection layers in the FFN module.

Table 3: Per-Layer FLOPs, memory operations (MOPs), and arithmetic intensity for the BERT-Base encoder with sequence lengths of 128, 512, and 4096 tokens. At low sequence lengths, the main contributors to both FLOPs and MOPs are the MHA and FFN projections. For longer sequence lengths, the act-to-act matmuls consume a greater proportion of FLOPs, and these operations along with Softmax consume the majority of MOPs. The act-to-act matmuls also have lower arithmetic intensity than the projection layers in the MHA and FFN for each sequence length.

Table 4: Per-Layer FLOPs, memory operations (MOPs), and arithmetic intensity for ResNet50. Convolutions consume the dominant proportion of FLOPs, but BatchNorm, ReLU, and the other operations contribute a significant proportion of MOPs.

Per-Layer FLOPs, MOPs, and Arithmetic Intensity. We then assessed the per-layer FLOPs, MOPs, and arithmetic intensity versus sequence length for the BERT-Base encoder (Tab. 3). As shown in Tab. 3, the proportion of FLOPs and MOPs consumed by the act-to-act matmuls increases with sequence length, and these operations have lower arithmetic intensity compared to the projection layers in the FFN and MHA modules. This explains the decrease in overall arithmetic intensity of encoder-only models for long sequence lengths, as observed in Fig. 6.

The low arithmetic intensity of the act-to-act matmuls relative to the projection layers is because the 𝑑/ℎ dimension in these two operations is small relative to the dimensions for the projection layers (𝑑 and 𝑑𝐹 𝐹 𝑁 ) and also relative to 𝑙, as the sequence length is increased. Small matrix dimensions lead to lower arithmetic intensity, as there are fewer operations to perform per element in the matrix, leading to reduced reuse. The low arithmetic intensity is further exacerbated with large activation sizes that must be loaded and stored for the act-to-act matmuls. This activation size not only grows quadratically with the sequence length 𝑙, but it is further multiplied by the number of heads ℎ since each head has its own activation (attention score) in the multi-head scheme. Therefore, as shown in Tab. 10 in Appendix A.3, a hypothetical BERT model with a smaller number of heads (thus with a larger 𝑑/ℎ dimension) would reduce the number of MOPs and improve the arithmetic intensity of the act-to-act attentions in the MHA module. This suggests that, when designing a Transformer architecture, the number of heads can entail a trade-off between accuracy versus performance metrics on hardware.

Additionally, Tab. 3 illustrates that while the nonlinear operations (classified as “Other” in the table) consume a small number of overall FLOPs, they consume a significant proportion of MOPs, especially for longer sequence lengths. Similar to the case of the act-to-act matmuls, the large number of MOPs in the Softmax operations for long sequence lengths is primarily due to several 𝑙 × 𝑙 matrices which must be either written out or loaded per attention head. This also indicates that the nonlinear activations, when handled poorly, can become a noticeable contributor to the overall performance, even though they might be overlooked due to their insignificant contribution to total FLOPs. We provide a similar perlayer analysis on the GPT-2 decoder in Tab. 11 of Appendix A.3, which demonstrates the significantly reduced arithmetic intensity across all layers, compared to the encoder-only model, resulting from a large number of memory operations.

Figure 7: Plot of the computation breakdown in the BERT-Base encoder versus sequence length on a CPU. For smaller sequence lengths, the projection layers in the MHA and FFN modules dominate the model latency. However, for longer sequence lengths the act-to-act matmuls begin to dominate.

Figure 8: Plot of the computation breakdown in the GPT-2 decoder versus sequence length on a CPU. The projection layers in the MHA and FFN modules dominate latency for shorter sequence lengths, but for longer sequence lengths the act-to-act matmuls become more significant. Note that nonlinear operations consume a more significant portion of latency than in encoder inference.

Comparison with ResNet50. To provide a baseline in terms of the FLOPs, MOPs, and arithmetic intensity for a typical CNN, we also included a corresponding analysis of ResNet50 (architectural details can be found in Appedix A.2). Tab. 4 provides a breakdown of the FLOPs, MOPs, and arithmetic intensity for ResNet50. Compared to the BERT-Base encoder with a sequence length of 128 (Tab. 3), ResNet50 without any operator fusion consumes 3.07 times fewer FLOPs and 1.28 times fewer MOPs, resulting in lower end-to-end arithmetic intensity than that of BERT-Base across all sequence lengths in Tab. 3. The low arithmetic intensity is partially due to the nonlinear operations in ResNet50 that consume a negligible proportion of FLOPs yet a significant proportion of MOPs, similar to the BERT-Base encoder. However, unlike the nonlinear operations in Transformers, these operations in ResNet50 can be fused with the preceding matmuls in a straightforward manner for inference. In particular, the ReLU operations can be applied directly to the accumulated outputs, and the BatchNorm operations can actually be folded into the prior convolutions. Fusing ReLU eliminates the MOPs for this operation, and folding BatchNorm eliminates both the required FLOPs and MOPs for this operation. Broadly speaking, operation fusion refers to a methodology in which the output values from one operation (e.g., a matmul or convolution) are directly used as input to the subsequent operation (e.g., a ReLU or BatchNorm) without first writing the output values to off-chip memory. Operator fusion eliminates the need for unnecessary memory loads and stores for the nonlinear operations, and therefore it further improves the end-to-end arithmetic intensity. As shown in Tab. 4, fusing these operations with the prior convolutions improves the overall arithmetic intensity for the ResNet-50 network from 66.9 to 121.4. In Tab. 12 of Appendix A.4, we provide more detailed numbers for the FLOPs, MOPs, and arithmetic intensity of several convolutional layers in ResNet50 as a reference.

Note that arithmetic intensity provides a rough estimate of how much data reuse is possible for different models and operations in the ideal case. Later in Sec. 3.3, we will discuss that analytical modeling can provide a more accurate, non-ideal estimate by taking account of the hardware details.

Figure 9: Plot of the normalized latency of the BERT-Base and BERT-Large encoders and the GPT-2 decoder versus sequence length on a CPU, normalized to the latency of BERT-Base with a sequence length of 128. The latency scales quadratically with respect to sequence length for both encoder-only and decoder-only networks. Additionally, for encoder-only and decoder-only networks with the same model architecture, the latency is significantly longer for the decoder-only network due to its reduced arithmetic intensity.

2.2.2 Profiling.

To analyze the bottlenecks in Transformer workloads on commodity hardware, we profiled Transformer inference on an Intel Gold 6242 CPU. We profiled the workload latency breakdown for both encoder-only BERT-Base and decoder-only GPT-2.

Latency breakdown. Fig. 7 and 8 demonstrate how the latency breakdown changes with respect to sequence length on a CPU for BERT-Base and GPT-2, respectively. These breakdowns illustrate that for short sequence lengths (e.g., 128-512), the majority of computations are in the projection layers of the FFN module, and that the majority of the MHA computation is in the projection layers. However, as sequence length increases, the act-to-act matmuls begin to dominate, as they both scale quadratically with sequence length.

End-to-end Latency. Fig. 9 shows the normalized latency for different sequence lengths for BERT-Base, BERT-Large, and GPT-2. It is evident that the GPT-2 latency is far longer than the latency for either BERT-Base or BERT-Large for each sequence length, even though BERT-Base and GPT-2 have largely the same model configuration and end-to-end FLOPs (as was depicted in Fig. 4). This is mostly due to the lower arithmetic intensity of matrixvector operations, which was highlighted in Fig. 6. A model with higher arithmetic intensity can run faster with the same (or possibly even more) FLOPs than a model with lower arithmetic intensity. These observations confirm our findings that decoder inference is a memory-bound problem and not a compute-bound problem. We revisit this issue at Sec. 4.3.3 to discuss some of the existing methodologies to speed up the decoding process.

Summary (Sec. 2.2. Model Analysis): Here are the highlevel takeaways from this section.

Both FLOPs and normalized latency scale super-linearly with sequence length for all Transformer models due to the quadratic complexity of the act-to-act matmuls. However, this trend is less obvious with small sequence lengths, where the main contributor to the overall computation is the FFN, which scales linearly with sequence length, rather than the MHA module (Fig. 4 and 9). - For encoder-only models, arithmetic intensity initially increases as the sequence length increases. However, it decreases for larger sequences since the MHA module (in particular, the act-to-act matmuls with lower arithmetic intensity) becomes the dominant contributor to total compute (Fig. 6).
The arithmetic intensity of decoder-only models is significantly lower than that of encoder-only models, leading to significantly longer end-to-end latency for the same sequence length. This is due to the fact that decoder models involve matrix-vector operations with limited data reuse, making them memory bandwidthbound rather than compute-bound (Fig. 6 and 9).
Matmuls consume over 99% of the FLOPs in both encoder-only and decoder-only models, and nonlinear operations are a relatively small portion of overall FLOPs. However, the nonlinear operations have extremely low arithmetic intensity, especially for the large sequence length, due to the large volume of activations they need to load and store.

3 HARDWARE DESIGN

So far, in Sec. 2, we have conducted an analysis of the run-time characteristics and bottlenecks of Transformer architectures. We now shift our focus to full-stack solutions for efficient Transformer inference, beginning with the design of efficient hardware. Sec.

3.1 Overview of Typical DNN Accelerators

Off-chip DRAM used for holding the weights and activations of the full network, which needs to be large enough to hold all model weights and activations;
Smaller on-chip memory, referred to here as the global buffer, which needs to be large enough to hold a subset of the weights and inputs in order to feed the processing elements (PEs);
An array of PEs, each of which has the capability to perform MAC operations, and which often contains one or more small local memories called register files (RFs) that can store data with lower per-access energy than the global buffer; and
An internal network-on-chip (NoC) that transfers data between A typical deep learning accelerator has a few key components, as outlined in [27]:

Fig. 10 shows the structure of a typical DNN accelerator. The global buffer is designed to be able to hold a sufficient number of the weights and activations in order to allow for data reuse and limit the number of transfers to and from the off-chip DRAM. The local memories in the PEs are used to provide local data reuse in order to reduce global buffer accesses whenever possible. Without reuse, MAC operation requires loading three parameters, the two input values that are being multiplied as well as the current partial sum (which is the partially accumulated value for a given location in the output matrix), and then storing the output value back to memory. This is important because memory reads and writes are orders of magnitude more expensive from an energy perspective [87]. For example, for one particular technology, reads from a local buffer are roughly 6 times as expensive as a single MAC operation, and reads from external DRAM are roughly 200 times as expensive [206]. Leveraging reuse opportunities is therefore critical in order to reduce the number of expensive memory accesses performed.

To maximize data reuse, there are two broad classes of dataflows that are widely adopted, which are referred to as temporal and spatial dataflows [27, 38, 206]. Temporal dataflows contain an array of centrally-controlled PEs that load data from the global buffer and perform the requested ALU (Arithmetic Logic Unit) operations before writing the data back to the global buffer. These PEs do not contain local memories, and there is no communication or data movement between PEs. Data reuse in this type of dataflow is only attainable through weight or partial sum reuse in the global buffer. Examples of temporal dataflows include Single-Instruction Multiple activations for the full network, and an internal network-onchip (NoC) can be used for transferring data between PEs. DNN accelerators typically aim to leverage either temporal dataflows (by performing the same operation in parallel on several datapoints) or spatial dataflows (where data can be transferred between PEs to leverage additional reuse opportunities). Spatial dataflow reuse schemes include weight stationary dataflows, which hold weights in local memories in the PEs to improve reuse.

3.2 Adapting DNN Accelerators fo Transformers

There are several key considerations when designing DNN accelerators for Transformers or adapting existing CNN accelerators. One difference between accelerators for CNNs and for Transformers is that due to differences in terms of arithmetic intensity and matrix dimensions, these models have different optimal sizes for each level of the memory hierarchy as well as different memory bandwidth requirements.

Another consideration is how the nonlinear functions are computed during inference, which imposes an additional challenge in hardware design. These operations require either specialized support for on-chip computation, or else they must be offloaded to the CPU. In Sec. 3.4, we will outline how the nonlinear operations can bottleneck inference even though they compose a small proportion of model FLOPs, especially if they must be offloaded to the CPU. Several accelerators for Transformer inference contain specialized post-processing units for nonlinear functions [107, 148, 166, 209]. However, adding an additional unit to support these operations also increases the area of the accelerator. This tradeoff between supporting these operations on-chip and accelerator area will be also explored in Sec. 3.4. Additionally, it can be challenging to design the hardware both to efficiently support the required nonlinear operations (e.g., Softmax and LayerNorm) and to support new nonlinear operations in future DNNs.

There are also considerations around datapath design, depending on whether the accelerator is being designed for the MHA module or for end-to-end Transformer inference. Accelerators specialized for the MHA module are designed to match the dataflow of the MHA module, where all the operations are “fused” together, thus having less flexibility but better performance by reducing the number of required memory accesses [64, 79, 80, 223, 237, 253]. Recall that operation fusion refers to using the output values from one operation (e.g., a matmul) directly as input to the following operation (e.g., a Softmax layer) without writing the intermediate values to off-chip memory. Several accelerators for the MHA module develop dedicated datapaths with separate units for the query × key, Softmax, and attention score × value operations in order to better leverage operator-level fusion. In contrast, accelerators for end-to-end Transformer inference typically employ a similar structure to Gemini [70] (which is outlined in more detail in Sec. 3.4) where they are designed to be more flexible by performing individual operations separately in a more general matmul engine [107, 148, 166, 209]. These accelerators also aim to fuse operations whenever possible to improve performance (for example, by

Figure 10: Basic structure of a DNN accelerator, assuming a spatial dataflow between the Processing Elements (PEs) (inspired by [206]).

Data (SIMD) and Single-Instruction Multiple Thread (SIMT) execution units. These type of units are commonly used in both CPUs and GPUs for vector processing. In temporal architectures, fullyconnected and convolutional layers are both mapped as matrixmatrix multiplication operations.

In spatial dataflows, the PEs can communicate and data can be moved between PEs to leverage data reuse, without repeated reads from the global buffer. The PEs themselves often contain RFs to hold weights or partial sums locally to improve data reuse, and additional reuse can be attained through passing data between adjacent PEs. Spatial dataflows are commonly used in FPGA and ASIC-based accelerators, especially for convolutional networks [38]. These dataflows allow for data reuse across multiple dimensions in order to drastically reduce the required memory accesses. In order to maximize reuse in spatial dataflows, several different reuse schemes have been employed:

Weight stationary dataflows minimize the number of reads required for weight matrices by keeping weights in the local RFs in PEs and streaming through inputs [72];
Output stationary dataflows minimize energy from reading and writing partial sums by accumulating the outputs in the local RFs in the PEs [59];
No local reuse dataflows have no RF in each PE, and use the area savings from having no RFs to allocate a larger global buffer [34]; and
Row stationary dataflows maximize reuse for both partial sums and weights by holding a row of the weights stationary in a row of the PEs, streaming in inputs, and streaming out partial sums [35].

Note that since DNNs consist of sequences of layers, it is also possible to fuse operations in order to further leverage data reuse across multiple layers. We encourage the reader to refer to Section V of [206], Sections IV-A and IV-B of [49], and Sections III-A to III-C of [27] for more comprehensive surveys and comparisons of the basic architecture of a DNN accelerator and typical accelerator dataflows.

Summary (Sec. 3.1. Accelerating Neural Networks): Typical DNN accelerators consist of on-chip memory for holding a subset of model weights and inputs and an array of processing elements (PEs) which can perform MAC operations. Off-chip DRAM is used for holding weights and applying Softmax directly to the accumulated outputs of a matmul before writing them out). However, the entire graph-level dataflow is not hardcoded in hardware as in MHA-specific accelerators.

In both cases, there are dataflow considerations for nonlinear function unit placement. This is because, as we have demonstrated in 2.2, non-linear operations generally have a high number of MOPs despite their small FLOPs count, and therefore the overall arithmetic intensity can be improved upon through operator fusion (as in the ResNet50 case). In the case of accelerators for the MHA module, in order to leverage operator-level fusion in the MHA module, the Softmax unit must be placed appropriately such that it can be computed after the query × key multiplication and before the attention score × value multiplication. For example, [64] places the Softmax unit in between specialized units for the query × key and attention score × value multiplications, and it computes LayerNorm in a separate hardware module. Placing functional units to support operator fusion provides higher efficiency, but this comes at a cost of less flexibility since the architecture now makes assumptions about the operator-level dataflow.

Summary (Sec. 3.2. Adapting DNN Accelerators for Transformers): Accelerators for Transformers and CNNs have different optimal sizes for the memory hierarchy as well as different memory bandwidth requirements. Accelerators for the MHA module tend to design hardened datapaths to exploit operator fusion. End-to-end Transformer accelerators tend not to design their datapath around the graph-level dataflow in the MHA module. Transformer accelerators tend to incorporate a post-processing unit to compute nonlinear functions efficiently on-chip.

3.3 Analytical Modelling

Analytic modeling is a useful tool for identifying bottlenecks when inferring DNN benchmarks, as it provides a quick estimate of runtime behaviors on the target hardware platform. At design time, it can be quite difficult to analyze the runtime behaviors of benchmark workloads as well as the potential impacts of hardware architectural changes on performance. This contrasts with the case in Sec. 2.2.2 where profiling can be conducted directly on actual hardware (e.g., CPUs). In cases where profiling is difficult or infeasible, analytical modeling can provide estimates to quickly guide design decisions. Here, we developed an analytical model to demonstrate how it can be useful in understanding the performance breakdown of Transformer inference on hardware accelerators. Our analytic model is based off of the Gemini-driven architecture [70], which will be outlined in more detail in Section 3.4.1. Its structure is illustrated in Fig. 11, along with the tunable parameters. The model includes local memories, a PE array for computing tiled matrixmatrix multiplications, and it relies on external memory for storing all model parameters. The performance estimates assume that compute time and memory operation time can be overlapped perfectly, and that the total for each operation is the maximum of these two. Note that double buffering was assumed in the scratchpad to ensure that compute could be overlapped with memory reads/writes wherever possible. The model structure is comparable to typical DNN accelerators, with the notable assumption that the included special function unit (SFU) is able to compute all required nonlinear operations, and thus none of these operations have to be computed off-chip. The model also assumes 𝑊 -cycle latency for the PE array, where 𝑊 is the width of the PE array, and 1-cycle latency per vector for the SFU.

Figure 11: Diagram of the structure of the basic analytical performance model, as well as the parameters that were varied in this analysis.

Latency Breakdown and End-to-end Latency. One useful scenario of analytical modeling is to obtain the estimated latency breakdown and end-to-end runtime latency. As an example, we applied analytic modeling to the BERT-Base and BERT-Large encoders as well as the GPT-2 decoder, under the assumption of square tiling for all matrix operations and no operation fusion (i.e., each operation required inputs to be read from external memory and outputs to be flushed out). In Appendix A.5, we provide the latency breakdowns for BERT-Base and GPT-2 (Fig. 30 and 31, respectively) as well as the end-to-end runtime latency of all models with different sequence lengths (Fig. 32). In general, the results of the analytical model show similar trends in runtime latency scaling and breakdowns as compared with the profiling results on the CPU in Sec. 2.2.2, only with slight differences in details. Note that the analytical model was designed assuming a hardware architecture that was different from the CPU architecture, and therefore the runtime behaviors would not necessarily be identical for different hardware platforms. More details can be found in Appendix A.5, including a comparison with the analytic modeling results on ResNet50.

Non-ideal Arithmetic Intensity. As with the analysis in Sec. 2.2, arithmetic intensity provides a rough estimate of how much data reuse is possible for different operations in the ideal case. However, in real-world scenarios, such as when tiling operations are required due to the size of the matrices exceeding the capacity of the local scratchpad, the arithmetic intensity will be further reduced. In such a case, analytic modeling can provide a more accurate estimate, namely non-ideal arithmetic intensity, by taking into account the hardware details. To take the tiling effect into account, we counted DRAM to L2 memory traffic in our analytical modeling, but not L2 to Scratchpad, in order to avoid double counting. Furthermore, we assume 32-bit output precision before the nonlinear operations, since it is known that low input precision (e.g., 8-bit) to those operations can result in a considerable accuracy degradation [111].

Table 5: Non-ideal arithmetic intensity for the BERT-Base encoder with sequence lengths of 128, 512, and 4096 tokens. The non-ideal arithmetic intensity is lower than the ideal arithmetic intensities (provided in Tab. 3) due to using 32-bit output precision before nonlinear operations as well as constraints from the memory sizes. Note that the differences in non-ideal arithmetic intensity between the \(W_Q\), \(W_K\), \(W_V\) projections and the \(W_{\text{out}}\) projection with the same operation dimensions are due to differences in output precision – \(W_{\text{out}}\) is followed by nonlinear operations, and therefore it uses 32-bit instead of 8-bit.

Sequence Length	\(W_Q\), \(W_K\), \(W_V\) projections	\(Q \times K\) Attn. score \(\times V\)	\(W_{\text{out}}\) projection	\(W_1\) projection	\(W_2\) projection
128	…	…	…	…	…
512	…	…	…	…	…
4096	…	…	…	…	…

Compared to the ideal arithmetic intensity that we have discussed in Sec. 2.2.1 (Fig. 6), which is 160, 231, and 118 for each sequence length, we observe significant reductions in the non-ideal arithmetic intensity. This is due to the effects of tiling as well as the large 32-bit output activations which must be loaded and stored before the nonlinear operations. The gap becomes even larger with a large sequence length (up to 2.5× reduction for sequence length 4096), where the effect of loading and storing intermediate values are more pronounced. This is different from the case of ResNet50, whose non-ideal arithmetic intensity of 121.312 does not diverge a lot from the ideal arithmetic intensity of 122.172. This also demonstrates how even though the ideal arithmetic intensity of Transformers was generally higher than that of ResNet50, the overall achieved arithmetic intensity is lower for Transformers across all sequence lengths.

Summary (Sec. 3.3. Analytical Modeling): Analytical modeling is a useful tool for identifying bottlenecks and runtime characteristics of DNN inference on a target hardware platform. This technique can be especially useful during the design phase, where profiling on actual hardware can be difficult, yet the analysis is necessary in order to make design decisions. We provided examples of using analytic modeling to obtain latency breakdown and non-ideal arithmetic intensity. In detail, we have demonstrated that the non-ideal arithmetic intensity of the Transformer can be further reduced (up to 2.5×) compared to the ideal case when the hardware constraints and implementation details are taken into account.

3.4 Case Study: Building a Transformer Accelerator

We now illustrate with a more “hands-on” example how architects familiar with mainstream accelerators for convolutional, visionbased workloads can design state-of-the-art transformer accelerators. Although the analytical model in Sec. 3.3 presents ideal latency and runtime predictions for Transformer inferences, approaching the ideal performance and efficiency in real-world hardware accelerators can take considerable engineering effort, which we explore here. We start with a fairly typical CNN accelerator generated by the Gemini [70] accelerator-generator, optimized primarily for

Figure 12: Baseline accelerator’s hardware architectural overview.

ResNet50-like workloads, and we discuss changes we made to this accelerator and it’s software stack to efficiently support transformer workloads such as BERT. Several accelerators for end-to-end Transformer inference employ a similar structure to Gemini and to our analytical model and also contain specialized post-processing units for nonlinear functions [107, 148, 166, 209].

3.4.1 Baseline Accelerator.

We first generate a fairly typical CNN accelerator, which is illustrated in Fig. 12, using the Gemini accelerator-generator. The accelerator performs matmuls using a 16×16 systolic array, which implements the weight-stationary dataflow. When performing convolutions, the dimensions of the output-channels and input-channels are spatially unrolled. The 8bit integer weights and inputs are stored in a 256 kB local scratchpad memory, and the 32-bit partial sums are stored in a dual-ported 64 kB accumulator SRAM which performs matrix additions. When DNN layers are too large to fit into the local scratchpad, they fall back onto an external L2 cache and DRAM which are shared with CPUs and other accelerators on the system-on-chip (SoC). A host CPU tiles such layers to compute the full outputs.

Although most of a CNN’s FLOPs are used to compute matmuls or convolutions, our baseline Gemini-generated accelerator also contains peripheral circuitry to execute ReLU and max-pool operations, as well as integer-float multipliers to scale 32-bit partial sums to 8-bit inputs that can be fed into the next layer in a CNN. Native support for these operations is important, as it eliminates the need for costly transfers back and forth between DRAM or outer caches (where the CPU can perform these operations) and the local scratchpad (where Gemini stores its matrix operands). Finally, note that this baseline CNN accelerator does not include any Transformer-specific features. In particular, there is no support for non-linear normalization operations such as LayerNorm module in Tab. 3. This suggests that the memory hierarchy and memory bandwidth of our baseline CNN accelerator should be re-tuned for more efficient Transformer inference.

3.4.3 Memory Hierarchy.

Transformer matmuls (in particular, the act-to-act matmuls) often have very different shapes and arithmetic intensities than the convolutional layers in CNNs, as also illustrated in Tab. 3 and 4. As illustrated in Fig. 13, simply adjusting the sizes of the input/weight scratchpad and 32-bit partial accumulator significantly improves the performance of BERT’s matmul operations. Larger accumulators enable higher output-reuse, which is more suited for several of the matmuls in Transformers. The query × key matmuls in particular have 𝑙 × 𝑙 output activation matrices, which for long sequence lengths are much larger than the 𝑙 × 𝑑/ℎ input query and key matrices. Increasing the accumulation buffer size therefore allows for improved output reuse with these operations.

Given this observation, we reduce the size of our baseline accelerator’s shared input/weight scratchpad to 64 kB from 256kB, and we increase the size of the partial-sum accumulator to 256 kB from 64kB. This involves no increase in the total SRAM capacity and virtually no change to the total area of our accelerator. However, these changes yield a much more substantial 36% reduction in total matmul latency.

3.4.4 Hardware-Software Co-Design.

As described in Sec. 3.3, matmuls are the dominant kernel in Transformer workloads, but even after maximizing matmul performance on our baseline CNN accelerator, it still fails to achieve above 1% utilization. This is due to the overhead of CPU-offloaded non-linear operations. Fig. 14 demonstrates that this is because only 1% of time is actually spent on matmuls. The rest is spent on floating-point non-linear activation, normalizations, or on quantization and dequantization operations, since they are offloaded to the CPU.

To alleviate the overhead of runtime quantization and dequantization, we switched our baseline Transformer workload from a naive BERT implementation, where only matmuls are quantized, to an integer-only BERT variant called I-BERT [111]. More details on quantization and I-BERT will be revisited in Sec. 4.1 and 4.3, but the main idea of I-BERT is to replace floating-point nonlinear operations such as GELU and Softmax with integer polynomial approximations such that they are both faster and cheaper to implement in specialized hardware accelerators.

To incorporate I-BERT, we added new integer implementations of I-BERT’s GELU, LayerNorm, and Softmax variants to our baseline CNN accelerator. The 32-bit matmul results resident in the accumulator are fed into a newly added “normalization unit” which computes sums, sums-of-squares, maxes, and other reductions which are used by LayerNorm and Softmax. Multiple passes of accumulator-reads are required to compute all the reductions in these operations. For example, a sum is computed first before a variance is computed using that sum. Afterwards, the matmul results in the accumulators are read one final time to be fed into a set of 16 activation units which compute I-BERT’s GELU, LayerNorm, or Softmax variants in parallel.

With these new features, overall end-to-end BERT inference performance improved by 39.6× over the baseline accelerator’s initial performance. As Fig. 14 illustrates, the computational bottleneck or Softmax. Neither is there support for GELU, which is a relatively expensive non-linear activation function often implemented with costly lookup tables. Instead, this baseline design is a typical example of an accelerator designed and optimized for quantized integer CNN inference. It achieves real-time or near-real-time performance on end-to-end CNN workloads such as ResNet50 [82], SqueezeNet [93], or MobileNetV2 [188], but (we will see that) the performance on Transformer workloads such as BERT is severely limited due to the need to perform operations such as GELU, LayerNorm, and Softmax on the CPU.

Figure 13: The matmul utilization while performing a BERT-base inference on our baseline CNN accelerator, with different scratchpad and accumulator sizes.

3.4.2 Performance Bottlenecks.

Our baseline CNN accelerator achieves far less than 1% utilization of its functional units when performing BERT inferences. Although individual matmuls achieve 74% utilization, operations that the accelerator doesn’t support natively, such as LayerNorm, significantly reduce performance as they must be performed by the CPU instead. In fact, Fig. 14 shows that 96% of execution is spent on non-matmul operations. Note that over 99% of FLOPs in our Transformer inference are MACs for matmuls, so the time consumed by each operation in the baseline accelerator’s run is far from the theoretical ideal, unless further optimizations are made.

Furthermore, our baseline accelerator offloads GELU and Softmax operations to the host CPU, which performs them with floatingpoint units. As shown in Fig. 15, floating-point adders or multipliers consume orders of magnitude more energy than the integer counterparts. In our baseline CNN accelerator, matmuls are performed with INT8 inputs, but these must be dequantized and requantized in between matmul operations for floating-point activations to be performed on the CPU, further contributing to the energy and latency overhead.

Finally, a specialized hardware accelerator’s memory hierarchy must often be carefully tuned based on the workloads running on it. CNNs primarily perform convolutions,1 which have very high arithmetic intensity, while Transformers primarily perform matmuls, often with small and/or rectangular matrices, with significantly lower arithmetic intensities and different optimal tiling strategies. For example, we observe the low arithmetic intensities of the MHA once again became the matmuls rather than normalization or activation functions; and this trend persists across different sequence lengths. Quantization and dequantization no longer become necessary, since the non-linear floating-point operations are replaced with I-BERT’s integer polynomial approximations. Also note that GELU operations can now be trivially fused with the preceding matmuls, so that they become one pipelined operation. When synthesized with the ASAP7 PDK [45], the new hardware units increased the total area consumption of the accelerator by only 14%, and the GELU, LayerNorm, and Softmax operations increased the power consumption of a BERT inference by only 9.3%.2

1 Note that some CNN operations, such as “depthwise convolutions” in models such as MobileNet [188], may also suffer from lower arithmetic intensities, but these operations are found in only a subset of state-of-the-art CNNs, and often constitute only a small portion of the total runtime of a vision model.

Figure 14: The time spent on different operations during a BERT inference with a sequence-length of 512, when running on (Left) the baseline CNN accelerator, and (Middle) the accelerator after it has been extended with I-BERT’s hardware/software features for Transformers. Note that with I-BERT support, quantization and dequantization operations are no longer required, because all operations happen in the integer format. (Right) The time spent on different operations with different sequence lengths after the change. For all sequence lengths, the total execution time is dominated by matmuls.

To summarize, as shown in Sec. 3.3, the nonlinear operations do not necessarily add much to the total FLOPs, area, or power consumption of Transformer accelerators in the ideal case. However, this may not be the case in practice, especially if the computation is offloaded to a CPU, leading to a non-trivial latency impact. We demonstrated that this can be addressed using the IBERT implementations of LayerNorm, Softmax, and GELU, which only increases the area of a Transformer accelerator by 5-15%, and adds 8% to the total latency.

Summary (Sec. 3.4. Accelerator for Transformers Case Study): The baseline Gemini accelerator designed for CNN architectures uses 8-bit integer weights, and it has an accumulator for partial sum storage as well as a small scratchpad that overflows to an external L2 cache. The performance of running Transformers on the baseline accelerator suffers for a number of reasons.

The bottleneck non-matmul operations running on the CPU takes 96% of total runtime;
Activation functions performed in floating point require repeated dequantization and requantization; and
The lower arithmetic intensity nature of transformer inference is more sensitive to non-optimized memory hierarchy.

2 Note that the ASAP7 PDK does not include energy models for SRAMs, so to derive the total energy consumption of our accelerator, we used the SRAM energy estimates in Accelergy’s [231] CACTI [159] plugin, and scaled them for 7nm.

To address these issues, we:

Reduced scratchpad capacity in favor of an increase in accumulator size, which enabled higher output reuse and improved memory efficiency;
Switched to I-BERT, an integer version of BERT that approximates floating point activations, eliminating quantization overhead; and
Added special normalizer units and activation units that offload GELU, LayerNorm and Softmax computations from the CPU.

These changes mitigate the bottleneck on non-matmul operations, and they trade a 14% area increase for a 39.6× performance improvement. An important takeaway is that even though the nonlinear operations in Transformers have little contribution to the overall FLOPs, area, or power, they can still have a non-trivial impact on latency.

4 MODEL OPTIMIZATION

Given a DNN model that has already been designed and trained, one important question is whether it is still possible to algorithmically improve the efficiency of the model on the target hardware platform, through the adaptation of the model into a more hardware-friendly format. In this section, we discuss popular offthe-shelf model optimization methods, quantization and sparsity (i.e., pruning), in Sec. 4.1 and 4.2, respectively. Then, in Sec. 4.3, we outline Transformer-specific optimization methods to improve the performance of Transformer-specific features such as attentions and nonlinear operations.

4.1 Quantization

DNN models are typically trained using high-precision floatingpoint computations. However, high-precision arithmetic is often unnecessary for inference. Quantization is a procedure for compressing DNN models by representing parameters and/or activations with a lower-bit, typically (but not necessarily) fixed-point representation such as 8-bit integer (INT8), instead of 32-bit or 16-bit floating point (FP32 or FP16).

Figure 15: (Left) Comparison between peak throughput for different bit-precision logic on Titan RTX and A100 GPU. (Right) Comparison of the corresponding energy cost and relative area cost for different precision for 45nm technology [87]. As one can see, lower precision provides exponentially better energy efficiency and higher throughput.

Quantization offers multiple benefits for efficient DNN inference. One obvious advantage of reduced precision is the reduction in memory consumption. For example, quantizing model weights from FP32 to INT8 leads to a 4× smaller model size. This leads to reduced off-chip storage and bandwidth without any modifications to a DNN accelerator. Additionally, quantizing activations further allows for reduced memory traffic and storage for intermediate partial results. The memory hierarchy can also be restructured accounting for the precision difference, either by allowing for greater local reuse by storing a larger number of parameters (since each parameter now consumes less storage space), or else by using smaller internal buffers while maintaining the same amount of local data reuse.

A second advantage of quantizing model weights and activations is the reduced size, latency, and energy consumption of the ALUs and the corresponding PEs. In general, floating point ALUs tend to be less efficient than integer ALUs in terms of area, latency, and energy consumption. This is because floating-point PEs need to multiply mantissas, add exponents, and perform a left shift using the exponent to get the final result when performing a single multiplication operation, whereas fixed-point PEs only require a multiplication unit. For this reason, modern GPUs and TPUs often contain INT8 processing paths [43, 100], which can significantly benefit from quantization. For example, as illustrated in Fig. 15, performing INT8 addition can be ∼30× more energy efficient and ∼120× more area efficient, as compared to the FP32 counterpart.

Another critical application for quantization is model deployment on integer-only hardware. Some edge processors for lowcost and power-efficient embedded devices such as ARM Cortex-M cores [12] and GAP-8 [66] do not include dedicated floating point units. When deploying models on these processors, not only must the model weights and activations be quantized, but also all computations must be conducted using only integer arithmetic. Otherwise, deployment is impossible or results in considerable overhead due to the need to process non-integer operations off-chip. This would lead to additional latency and energy consumption for data transfer to a general-purpose host processor. This quantization technique of carrying out the entire inference using integer arithmetic is known as integer-only quantization [97, 110, 111, 132, 241]. We have discussed in Sec. 3.4.4 that integer-only quantization reduces the end-to-end inference latency by 39.6× on Gemini. Quantization methods can broadly be categorized into uniform and non-uniform quantization, depending on how they map the values. Uniform quantization splits the floating-point domain into evenly spaced intervals and maps each interval into a single fixed point value. This can be obtained from a simple arithmetic rule:

\[Q(r) = \text{Int}\left(\text{r}{S}\right) + Z,\]

where \(Q\) is the quantization operation, \(r\) is the floating point value, \(S\) is a scaling factor, and \(Z\) is a shift factor.

Non-uniform quantization, on the other hand, does not require the intervals to be evenly spaced. By assigning more quantization bins to important regions, generally resulting in improved compression rates, non-uniform quantization can more accurately capture the original data distribution in the floating point domain than uniform quantization. However, it is typically more challenging to efficiently deploy non-uniformly quantized models on general computation hardware. As a result, uniform quantization is currently the de-facto method for its simplicity and efficient mapping to hardware.

While lower bit quantization can lead to a better compression rate, reducing the precision too aggressively can significantly degrade the model accuracy. It is therefore crucial to achieve a balance between performance gains through reduced precision and maintaining model accuracy. One promising strategy for alleviating this issue is mixed-precision quantization. It is known from previous work [55, 195, 224, 229] that different layers in a model exhibit different sensitivity to quantization, and that it is critical to assign higher bit precision to the more sensitive layers. Notable works for quantizing Transformers with mixed-precision include Q-BERT [195] that uses the Hessian information (i.e., curvature) as a proxy for sensitivity, and HAT [222] that applies reinforcement learning (RL) to learn the appropriate bit precision per layer.

Another challenge with quantizing pre-trained Transformer models is the presence of outliers in activations [117]. Uniform quantization, which attempts to divide the range from the minimum possible value to the maximum possible value into multiple bins, can result in significant performance degradation. This is because more values are mapped to the same quantized value (i.e., resolution degradation) due to the outliers that extend the interval of each quantization bin. While non-uniform quantization can be a solution to circumvent the outlier issue [246], a uniform quantization scheme that assigns larger bit precisions to activations containing outliers has been proposed as well [51]. Furthermore, the recently introduced FP8 precision [156], which provides extra degrees of freedom in setting the exponent bit precision, has been found to be a suitable solution for quantizing models whose integer quantization results in reduced accuracy due to the presence of outliers [121].

For more information about this topic, see Section III-F of [27], Section IV-C-3 of [49], and Section 3.1 of [13], as well as [71] for a more comprehensive survey of software-level approaches.

Summary (Sec. 4.1. Quantization): Quantization is a way of compressing DNN models by reducing the precision of model parameters and/or activations. The immediate benefit of quantization is reduced memory consumption, which allows reduced off-chip storage and bandwidth, and a more efficient memory hierarchy design. Furthermore, quantization can reduce the size, latency, and energy consumption of the ALUs and the corresponding PE via low-bit precision arithmetic. In some cases, quantization also makes it possible to deploy DNN models in integer-only hardware units, which otherwise may be impossible or may incur considerable overhead for offloading non-integer operations off-chip. While many DNN models are robust to a certain level of quantization noise, certain algorithm-level advancements are necessary to prevent accuracy degradation with lower-bit precision (e.g., INT4 or even less). In particular, special considerations must be taken for quantizing pretrained Transformers without accuracy degradation as they are known to have outlier activations.

4.2 Sparsity

Another common avenue for reducing the overall number of computations required for deep learning inference is through introducing sparsity. Sparsity (also known as pruning) is a procedure of making DNN models sparse by removing those redundant/insensitive parameters. While it has been observed that having a dense model may be necessary to successfully train a model, it is also possible to remove many of the parameters after the model has been trained, without any quality degradation. It is known that training large models and then compressing them via pruning achieves better accuracy then training a compressed model from scratch [133]. This may be due to the fact that having redundant parameters from the beginning of the training may make the loss landscape easier to optimize [139]; or it may be related to the increase in the likelihood of obtaining a “lottery ticket” [67].

Broadly speaking, pruning can be categorized into two branches: unstructured pruning; and structured pruning. Unstructured pruning allows arbitrary patterns of sparsification for parameters and feature maps. It can, in theory, produce significant computational savings without accuracy degradation [136]. However, unstructured pruning can be challenging to leverage effectively in hardware. In order to store the data effectively without storing the null (i.e., zero) parameters, a compressed memory format is necessary. Additionally, the computation units must be adjusted to be able to operate directly on the compressed data. Otherwise, the parameters must be decompressed before computations and then re-compressed afterward, leading to additional overhead. For these reasons, commodity DNN accelerators might not efficiently exploit unstructured sparsity patterns.

Structured pruning circumvents these limitations by strictly removing structured sets of parameters. For instance, in Transformers, rows and columns in linear layers, attention heads [155], or even entire layers [63, 186] can be structurally pruned. Recent work has further integrated the structured pruning of different architectural components into a single framework (e.g., pruning attention heads in MHA modules and filters in FFN modules together) [88, 123, 145, 233]. Such structured pruning methodologies immediately lead to dense matmuls that are smaller than the original, eliminating the need for a compressed memory format or special hardware support to gain memory reduction and latency improvement. However, the compression rate might not be as good as with unstructured pruning. It has been shown that a state-ofthe-art unstructured pruning method [120] can prune up to 90% of the parameters in BERT [52] without any performance drop on the MNLI benchmark [226], whereas the same performance can only be achieved with a state-of-the-art structured pruning method [233] by pruning up to 70% of the parameters.

While the aforementioned pruning methods belong to weight pruning, activation pruning (i.e., dynamic pruning) can also be applied to dynamically detect and zero out unimportant activations at run-time. In Transformer inference, a popular branch of activation pruning is token pruning [74, 108, 113, 150, 223], which detects and drops less important tokens in each Transformer layer from the rest of the inference. The underlying rationale is that not all tokens (e.g., words in NLP tasks) are necessary to understand the meaning of the input sequence. By reducing the sequence length that Transformers need to process, these methods have demonstrated a reduction of up to ∼ 30−50% in the total number of computations required, without causing a noticeable drop in accuracy in NLP benchmarks [181, 220]. However, accelerating such dynamic sparsity patterns can be a challenge, as it requires detection logic to determine the location of nonzeros on-the-fly. Therefore, in many cases, dynamic sparsity requires designing algorithm and hardware together.

Regardless of the pruning methodologies used, the primary concern is determining which weights should be preserved and which should be removed in order to improve the efficiency of the neural network without sacrificing its performance. Common methodologies for pruning Transformers include the following:

Magnitude pruning [68] is a technique that uses the absolute value of each weight as a proxy for its importance. It prunes the weights with the smallest magnitudes during the training process. The rationale behind this approach is that smaller weights contribute less to the model’s final outcome.
Movement pruning [124, 189] is a technique that takes into account the changes in weights during fine-tuning, assigning a larger importance score to the weights that move further away from zero as the fine-tuning process progresses. This technique has been found to be more effective than magnitude pruning for models that are trained using the pre-training and fine-tuning scheme (e.g., BERT [52]), as it better captures the importance of weights as the fine-tuning process progresses.
First-order pruning [155] uses gradients with respect to the loss that flow into the weights or a group of weights as a proxy for evaluating the importance of the model accuracy. This approach considers the gradient to be an indicator of the impact of zeroing out a parameter on the loss. This scheme was further improved [88], where the product of weight magnitude and gradient was used as a proxy for importance, as it may be a more accurate estimate of the impact of zeroing out weights. - Second-order pruning [120, 123, 245] uses the Hessian matrix of the weights or a group of weights with respect to the loss as a proxy importance metric. Compared to the first-order information, the second-order information is generally known to be a more accurate indicator of the effect of removing weights. However, due to the large size of the Hessian matrix, which grows quadratically with the number of weights, it is necessary to employ an appropriate and scalable approximation, often with algorithms from randomized numerical linear algebra [57, 242].

One of the main advantages of pruning is the reduction in memory footprint. The gain in memory efficiency is straightforward with structured pruning, which directly reduces the size and/or number of matrix multiplications. In contrast, unstructured pruning often requires the use of sparse encodings (also known as sparse storage formats) to compress and store sparse data. These methods use less memory by employing metadata to encode the positions of the nonzero entries in the matrices [27, 47]. Sparse encodings can reduce off-chip memory consumption and the corresponding required memory traffic. They can also reduce the required storage size on chip, thereby allowing for smaller buffers or, alternatively, increased reuse. This is because, although the same amount of data can be stored in a buffer, the encoded data corresponds to a greater proportion of the full-sized input tensor.

Pruning can also lead to reduced energy consumption and latency due to the elimination of unnecessary computations. Similar to what we described above, this is relatively straightforward to achieve through structured pruning, but unstructured pruning requires special techniques for identifying and bypassing calculations involving null elements [8, 9, 35, 164, 248, 251]. This can involve identifying and skipping individual elements or entire null vectors. Some detection and skipping methods only save energy by not performing the operation involving the null element. That is, the PE doesn’t have to be used for the null computation, in which case it avoids energy consumption. Other methods additionally seek to reduce latency by assigning a different effectual computation to the skipped PE, rather than having them idle for the ineffectual compute cycles. Furthermore, in order to maintain PE utilization with unstructured sparse matmuls, it may also be necessary to perform load balancing. Since the distribution of zeros can be unbalanced between PEs, some PEs may require a longer execution time than others, resulting in idle waiting periods of the others. Several works have used load balancing for accelerating neural networks with unstructured sparsity [81, 119, 147].

We refer interested readers to Section V of [47] and Section III of [27] for a more comprehensive overview of sparse encoding methods. Additionally, a general overview of hardware architectures that leverage various sparsity patterns is provided in [47].

Summary (Sec. 4.2. Sparsity): Sparsity (or pruning) is another widely-used method of reducing the inference cost of overparameterized DNN models by removing redundant or less important weights and activations. Similar to quantization, pruning helps to reduce off-chip memory consumption and the corresponding memory traffic, as well as energy consumption and latency. Pruning can be broadly divided into weight pruning and activation pruning. Weight pruning can be further divided into unstructured pruning, which allows any sparsity pattern, and structured pruning, which imposes an additional constraint on the sparsity pattern. While structured pruning can provide benefits in terms of memory, energy consumption, and latency without additional hardware support, it is known to achieve less compression rate than unstructured pruning. Activation pruning prunes redundant activations during inference, which can be especially effective for Transformer models. However, this requires support to dynamically detect and zero out unimportant activations at run-time.

4.3 Transformer-specific Optimization

Methods The use of off-the-shelf optimization methods such as quantization and pruning can lead to significant performance advantages. Nevertheless, there are other optimization strategies that are tailored specifically for the Transformer architecture, e.g., by taking advantage of the features within it. Here, we review the significant Transformer-specific optimization techniques that can further optimize Transformer inference.

4.3.1 Accelerating Attention.

Several works aim to optimize the attention mechanism in the MHA module. Recall that the time spent performing the matrix multiplications in the MHA module grows quadratically with sequence length for long sequences, as outlined in Sec. 2.2.2. Therefore, for long sequences, computing attention becomes the dominant portion of the overall runtime. One common route for more efficiently computing the attention network is token pruning. This involves removing unimportant tokens so as to reduce the effective sequence lengths, as was discussed in Section 4.2. The need to efficiently identify and drop unimportant tokens on-the-fly has led to several hardware-software co-design approaches. In SpAtten [223], tokens are ranked based on the amount of attention they are getting from other tokens in the input sentence, and the tokens that are receiving less attention are pruned out. This approach is based on the simple rationale that the more a word is attended, the more important it is in the inference process. To make this efficient, a top-𝑘 hardware engine is employed to filter out the low-importance tokens based on their attention scores. DTA-Trans [237] takes a step further by introducing the two-tiered scheme where in the first round, it determines which tokens should be pruned, and in the second round, it further determines the bitprecision to be allocated to each of the remaining tokens based on their significance.

Another approach to accelerate attention is to leverage the dynamic sparsity patterns of the attention score activations [79, 80, 131, 147, 172, 194, 204, 253]. It is reasonable to assume that many combinations of query and key tokens are not semantically meaningful, and thus the attention score associated with this combination will be close to zero. By taking advantage of this sparsity, the inference accuracy can be preserved while the computational cost is reduced by avoiding the associated act-to-act matmuls (i.e., query × key or attention score × value). However, this requires specialized hardware logic to detect and accelerate those dynamic sparsity patterns on-the-fly. For instance, ELSA [80] proposes a datapath that approximates the angular similarity between key and query vectors, thus allowing the prediction of whether their dot product is likely to be zero. This approach enables the pruning of less important key vectors with respect to a given query vector in advance. The Sanger framework [147] suggests quantizing the query and key values prior to computing the attention score, as this will zero out insignificant entries in the resulting attention score that would have been close to zero if those values were not quantized. Similarly, DOTA [172] proposes to approximate the attention score entries to be zeroed out by employing the matrix multiplication of low-rank (and hence smaller) projections of the query and key values as a proxy. LeOPArd [131] uses bit-serial computing for the query × key multiplications in order to terminate computation early if it will not reach the pre-determined threshold.

It is worth noting that hardware support is essential for accelerating attention mechanisms, as it enables operations such as top-𝑘 [223, 237], angle approximation [80], clustering [204], and multi-precision computation [13, 147, 172, 253] that are necessary to detect the dynamic sparsity patterns of attention scores. Furthermore, specialized hardware support is needed to take advantage of the (mostly unstructured) dynamic sparsity for skipping computations. For example, Sanger [147] uses load rebalancing through splitting and packing, and it is equipped with a custom datapath that provides support for both sampled dense-dense matmuls and sparse matmuls.

4.3.2 Nonlinear Operations.

As discussed in Sec. 2.1.1, the Transformer architecture contains multiple nonlinear functions that pose multiple challenges in efficient hardware design. Incorporating a hardware module specialized for computing these operations may be a viable solution. However, this may incur a considerable overhead for hardware design, particularly when targeting low-end devices. Therefore, various solutions have been proposed to circumvent this issue, without constructing a dedicated hardware module. One popular solution is function approximation [107, 111, 132, 203, 223], which seeks to approximate the exact value of the nonlinear function, in order to obtain a good yet computationally efficient approximation. For instance, Keller et al. [107] uses the Softermax [203] function, which uses a base-2 approximation that switches the base used in the exponential calculation of the Softmax operation from 𝑒 to 2, allowing for simplified hardware implementations. Softermax [203] also incorporates online normalization [157], thus reducing the number of passes required for the numerically stable Softmax computation from 3 to 2. I-BERT [111] provides a more general approximation algorithm that approximates the nonlinear operations with 2nd-order polynomials. This not only simplifies the operations, but it also enables them to be performed using only integer arithmetic. SpAtten [223] takes a similar approach to use a 5th-order Taylor approximation for computing Softmax, as described in [160]. I-ViT [132] further extends this idea to use hardware-friendly bit shifting operation to efficiently compute the nonlinear operations for vision Transformer inference. While the major focus has been approximating the exponential operation for the Softmax, other works [19, 209, 225] have also exploited the log sum-exp trick to avoid the division operation, another operation that can be complicated to implement in hardware [60].

Another widely-adopted approach is lookup tables, which store pre-calculated output values for a given range of inputs. In this case, if the input is given, the corresponding value stored in the lookup table is outputted, eliminating the need for evaluating the function. The use of lookup tables to accelerate the nonlinear function is by no means a new concept, with its root predating the advent of Transformer or DNN architectures [50, 212]. Recent approaches, therefore, have more focused on reducing the size of the lookup table to save area and latency. For instance, 𝐴3 [79] decomposes the exponential operation into a multiplication of two smallerprecision exponential operations, allowing one larger lookup table to be replaced with two smaller ones. NN-LUT [244] approximates the nonlinear operation using a single-hidden layer network and stores a numerical approximation of the network in a lookup table, thereby avoiding the need for executing the network.

4.3.3 Accelerating Decoding.

As discussed in Sec. 2.2, Transformer decoding for generative inference can entail a significant inference latency due to the low hardware utilization and arithmetic intensity. Due to the growing interest in generative tasks due to the recent advancements of Large Language Models [2, 22, 44, 213], it is critical to optimize the latency for the decoding process. One avenue to reduce inference latency is to skip unnecessary computation through early exiting. This method dynamically adjusts the depth of the decoder for each token generation by terminating the inference at a mid-layer and making a prediction using the intermediate hidden states, rather than waiting until the end layer. While being a well-explored technique for encoder models [192, 234], CALM [191] has only recently extended this methodology to decoder models. A major challenge in decoding tasks is that, unlike in encoding tasks, the generation of one token relies on the activations of all previous tokens, due to the attention mechanism. If a previous token is exited early, then there is nothing to attend for the skipped layers. To address this issue, CALM proposes “state propagation,” which copies the activations of the final layer before exiting to all the skipped layers. This had a minimal impact on generation quality. Another recent attempt is to collaboratively use multiple models with different sizes [31, 112]. The underlying motivation is that the majority of simple word generation can be offloaded to a faster, less accurate model with a smaller size. Once in a while, when the small model is unable to predict a word accurately, it switches the control to the larger model for more accurate prediction. This approach not only enables the execution of the large model to be carried out less frequently, but it also enables its non-autoregressive (i.e., tokenlevel parallel) execution since it can consume all tokens generated from the small model and process them in parallel, thereby utilizing hardware more efficiently. Big Little Decoder [112] has shown ∼2× inference latency reduction across various models and generative tasks without compromising generation quality.

4.3.4 Selecting Which Optimization Methods to Use.

So far, we have discussed various optimization techniques that can be applied to the Transformer architecture. It is important to note that a significant portion of these optimization methods depends upon the underlying hardware support. Thus, when selecting which optimization techniques to employ, it is essential to adopt a holistic view of both the hardware and software stack, taking into account the characteristics of the underlying hardware. In particular, whether the accelerator supports MHA and FFN modules in the same datapath versus containing separate datapaths for each of these modules can have a significant impact on the optimizations that can be performed.

Accelerators with a unified datapath tend to pursue more general optimizations that can either be applied to both MHA and FFN modules, or at least those that do not require altering the datapath such that it can no longer compute the other modules. For example, several accelerators that support both MHA and FFN modules employ general static pruning methods for weight matrices [65, 166, 209], but do not aim to exploit attention-specific pruning methods such as dynamic sparsity. However, more exotic optimizations can be pursued separately for the MHA and FFN modules, if they are computed in separate datapaths or if the PEs can be reconfigured. For instance, FABNet [64] exploits static sparsity patterns that can only be applied to the FFN module by adopting separate datapaths for the MHA and FFN modules. FTRANS [127] also applies different optimization methods for the MHA and FFN modules by incorporating reconfigurable PEs that can handle both workloads without having two separate datapaths. However, employing separate datapaths or reconfigurable PEs can incur an additional overhead, as compared to using a general, unified datapath. Consequently, there is a tradeoff to consider between the area overhead and the performance gain derived from the use of more aggressive optimizations.

Summary (Sec. 4.3. Transformer-specific Optimizations): While general off-the-shelf optimization methods can also benefit efficient Transformer inference, a great deal of research has been conducted to devise optimization strategies that take advantage of Transformer-specific characteristics. One opportunity is to optimize the attention mechanism in the MHA module, whose runtime cost increases quadratically with sequence length. For instance, dynamic pruning has been widely applied to take advantage of the sparse nature of attention score activations. Additionally, efficient computation of the nonlinear operations should also be taken into account. In order to reduce the hardware costs associated with the implementation of dedicated hardware units for nonlinear operations, function approximation, and lookup table methods have been proposed as viable alternatives. Finally, the model optimization methods should also be aware of the underlying hardware architectures and datapaths. The use of separate datapaths for the MHA and FFN modules can have higher area overhead, but can enable more aggressive optimization as compared to using a single datapath for both modules.

5 MAPPING TRANSFORMERS TO HARDWARE

In order to execute a Transformer block on a target hardware architecture, we must map it into hardware instructions that carry out both the required computation and communication. The choices made during the mapping process play a significant role in performance. However, the size of the space of possible mappings makes finding the optimal mappings difficult, and it requires the use of carefully considered exploration, heuristic, or learning-based approaches.

In this section, we provide an introduction to the mapping problem in Sec. 5.1; and we discuss key mapping decisions for efficient execution of Transformers in Sec. 5.2. We overview the taxonomy of existing mapping techniques in Sec. 5.3 and similarly for techniques to model the performance of different mappings in Sec. 5.4. Finally, we end with Transformer-specific considerations for mappers in Sec. 5.5.

5.1 What are Mappings?

A mapping or schedule is defined as a sequence of hardware instructions to execute a set of operations on the specific target hardware architecture. In the case of a systolic array accelerator such as Gemini, such hardware instructions might include dense matmuls under a specific dataflow and load/stores to move data between off-chip DRAM and local SRAMs. A mapping will list the complete sequence of data and memory instructions, with the end goal of producing source code or compiled binaries that can be executed on hardware.

For some operations, there may exist multiple valid mappings, where the validity refers to the guarantee of correctness from executing each mapping. Specifically, different sets of mapping decisions applied to the same problem may result in valid yet different mappings. We refer to the total space of mapping decisions and their resulting mappings as the mapspace. Details about individual mapping decisions are discussed in the following Sec. 5.2.

It is not surprising that two different valid mappings may exhibit differences in end-to-end performance, when measured with respect to latency, bandwidth, and energy consumption. Hence, it is often the goal of a mapping or scheduling framework to obtain Pareto-optimal, valid mappings for a given software operation and desired performance metrics. For some operations, finding a good mapping is unnecessary, either because the problem is trivial, as the mapspace is small, or because the operation itself is not a performance bottleneck and does not warrant the effort of judicious scheduling. However, in the case of core computational operators in DNNs, including Transformers, the mapping problem is both challenging due to large mapspace and rewarding due to the potential gains in overall model execution speedup.

Figure 16: Visualization of mapping for the convolution operation in CNNs onto a typical DNN accelerator. Convolution is represented as a six-nested loop excluding the batch dimension. Loop permutation concerns the order in which each loop level should be executed, with memory accesses to and from either the accelerator’s local memory or off-chip DRAM. Spatio-temporal mapping determines which loop level should be executed in parallel using accelerator hardware resources. Tiling factors are the loop bounds of each loop level, where each dimension can be broken down with tiles into several sub-loops. As shown in the example, the input channel size dimension (𝐶) is tiled with a tiling factor of 8, hence into two sub-loops with loop variables 𝑐0 and 𝑐1.

Figure 17: Visualization of mapping for the matmul operation in Transformer encoder/decoders onto a typical DNN accelerator. Matrix multiplication is represented as a three-nested loop. Loop permutation concerns the order in which each loop level should be executed, with memory accesses to and from either the accelerator’s local memory or off-chip DRAM. Spatio-temporal mapping determines which loop level should be executed in parallel using accelerator hardware resources. Tiling factors are the loop bounds of each loop level, where each dimension can be broken down with tiles into several sub-loops. As shown in the example, the output column dimension (𝑁 ) is tiled with a tiling factor of 4, hence into two sub-loops with loop variables 𝑛0 and 𝑛1. As we will discuss in Sec. 5.5.1, even though matmuls have 3 nested loops instead of 6 as in the convolutions, finding an optimal mapping could still be as challenging.

Fig. 16 and 17 illustrate examples of key operators and their possible mappings, for CNNs and Transformers, respectively. As shown in the example mappings, each level of the nested loops must be: (1) assigned to be executed either with data from DRAM or from local accelerator memory; (2) assigned to be executed spatially (i.e., in parallel) or temporally (i.e., sequentially), if the accelerator contains spatially parallelized compute resources; and (3) assigned to be executed as one loop or tiled into multiple subloops (and if so, with which tiling factors). In particular, for the case of Gemini, spatial mapping concerns the decision to assign which loop levels to be executed on the N-by-N systolic array mesh of PEs.

5.2 What Are the Key Mapping Decisions?

Mapping occurs in two steps. First, the graph is transformed at a graph level into a set of tensor operations. This may involve fusing successive operations, sparsifying tensors, and deciding on appropriate quantization strategies. Then, each resulting tensor operation is scheduled in order to transform it into hardware instructions of the appropriate size.

5.2.1 Graph-level.

Graph-level scheduling involves decisions that change the structure of the computational graph, rather than simply the execution schedule of tensor operations represented by individual nodes within the graph. Typical changes include the following:

Layer fusion or operation fusion refers to combining multiple layers (e.g., a matmul followed by a normalization layer) into a single tensor operation to be scheduled and run on the accelerator. This reduces interlayer communication, as the results of one layer can remain on the chip as input without being written to and later read from main memory, at the cost of intralayer communication. As we will see in Sec. 5.5, layer fusion may not provide as much latency improvement, as with CNN architectures, since static fusion opportunities are not as straightforward as fusing convolutions with BatchNorm layers. For Transformers, it is possible to combine several operations in the same kernel, but this may increase intralayer communication to an extent that renders such approaches infeasible. Furthermore, this can also be dependent on the target hardware platform.
Dynamic sparsification of tensors happens when the pruning decisions are made based on the activation maps. Common methods for dynamic sparsification includes locality-sensitive hashing to zero out dot products likely to be small. This can significantly reduce the number of arithmetic operations required in the operation, as was also discussed in 4.2. Such optimizations are heavily data-dependent as they require access to the activations and, as a result, cannot always be estimated a priori [103]. As a result, relatively few results on sparsity-aware mapping exist, and those that do largely cover operation-level mappings for a given amount of sparsity.
Static sparsification of tensors happens when the pruning decisions are independent of the activations and are determined statically. As was discussed in Sec 4.2, there are various methods used for static sparsification. In general, structured sparsity results in high speedup, but it also often results in non-trivial accuracy degradation, whereas unstructured sparsity is able to retain accuracy even with extreme sparsity levels, but it is hard to accelerate. Nevertheless, the latter is going to become increasingly more important since it reduces the memory traffic, which is becoming a major bottleneck for power consumption.

5.2.2 Operation-level.

The operation-level scheduling step decomposes tensor operations into a set of tasks to be run on a given architecture. This consists of several different steps, each of which presents a programmer with a decision problem. These include:

Dividing the operation into tiles that can fit onto different layers of the memory hierarchy; the dimensions of the tiles are a choice (e.g., tile sizes in Fig. 17).
Determining the dataflow of the computation, i.e., the order that the tiles are executed in and the tensors that are held stationary or moved across the processor. This can be encoded as a loop ordering problem, with the innermost loops corresponding to axes of tensors being held stationary (e.g., any loop permutation in Fig. 16).
Deciding which axes to parallelize, and which to run serially, which we refer to as spatio-temporal mapping.
Deciding how to interleave communication and computation in order to minimize latency. For instance, double-buffering may divide the scratchpad into two halves, with one half being used by the processor for computation while the other is loaded with data from memory.
Mapping arithmetic instructions onto hardware instructions. For some architectures, this may be as simple as replacing a matmul operation of the appropriate size (achieved by a tiling) with a call to the appropriate ISA (Instruction Set Architecture) instruction. For others, it may involve selecting between different vector instructions, which may affect the decision of which axes to vectorize, and the resulting spatio-temporal mapping.

A more complete description can be found in [128].

The choice of points in the mapspace heavily affects performance, by up to several orders of magnitude, as we will discuss in Sec. 5.5.1. For this reason, the goal of a hardware mapper is to select a point in this space to minimize some cost such as energy, energy-delay product (EDP), latency, etc., on a given hardware target. However, the size of the mapspace renders exploration difficult. For example, considering only tiling, spatio-temporal mapping, and loop ordering (dataflow), the number of possible mappings for a BERT attention layer can exceed 1012. As a result, the design and selection of mappers have been the subject of significant attention in both theory and practice.

Furthermore, the optimal mapping can significantly differ depending on hardware architecture, and mappings that work well for one set of hardware parameters often perform poorly on others [103]. This significantly increases the difficulty of mapping within a codesign context, as one must be computed for every pair of neural network and hardware architecture.

Summary (Sec. 5.2. Key Mapping Decisions): Mapping Transformers to hardware require decisions to be made both at the graph and operator levels. These decisions range from choosing simple numerical or categorical parameters to structural modifications to the program being run. The space of decisions required is enormous, growing combinatorially with each possible decision, but selecting a good point in the space can significantly affect performance.

5.3 Finding Performant

Mappings In order to deal with the size of the search space, many acceleratoraware mapping techniques [84, 92, 102, 129, 163, 239] and fullyfledged compilers frameworks [3, 32, 33, 114, 161, 167, 185] have been developed. These are briefly discussed below.

5.3.1 Mapping Strategies.

To deal with the size of the search space, mapping algorithms focus on a subspace of the mapspace, only making decisions about how to perform a subset of steps required to map the network onto the architecture.

Graph-level Schedulers. Most of existing ML compilation frameworks (e.g., XLA [185], TensorRT [161], TVM [33], Glow [184] and CuDNN [41]) target graph-level optimizations such as operation fusion, resource allocation, graph partitioning, graph rewriting, etc. A large number of operation fusion techniques [10, 247, 252] have been developed to optimize the mapping for more data reuse across DNN layers. Among these, a few Transformer-specific operation fusion techniques have been proposed [42, 168]. In particular, [42] decomposes the Softmax layer and dynamically fuses the GPU kernels for decomposed layers with the proceeding and succeeding matmul layers in the MHA block. Relatedly, [168] shows that fusing LayerNorm layers and composing big matmul from small matmuls are beneficial to transform performance on GPUs. In [104], an optimized dataflow for DNN accelerators is introduced to efficiently fuse the key matmul and Softmax layers in MHA. To learn about the operation fusion tradeoffs in the Gemini accelerator, we have performed a case study and included the analysis in Sec. 5.5.

Operation-level Mappers. In Sec. 5.2.2, we discussed that the decisions around tiling, dataflow, and spatio-temporal mapping can result in an enormous search space, and that selecting a good point in the space is key to achieving high efficiency and utilization in ML accelerators. Within the scope of a given subspace, mappers can generally be divided into three general categories, based on how they make their decisions: brute-force search; feedback-based search; and constrained optimization. Tab. 6 summarizes existing mappers that leverage different techniques to navigate the mapspace.

Brute-force methods [29, 48, 163, 239] entail various sampling strategies that either exhaustively explore or randomly sample a large number of points from the mapspace. To lower the exhaustive search cost, mappers in this category typically rely on developer heuristics to prune the mapspace and lightweight performance models to compare all valid mappings to find the best mapping in a reasonable amount of time. The disadvantages of this approach are two-fold: not only does a brute-force search tend to be exceedingly expensive, especially for more complex target workloads and hardware architectures; but also this costly process repeats for any target workload or accelerator architecture changes, without leveraging any prior knowledge.

Feedback-driven approaches use ML algorithms or other statistical methods [7, 33, 99, 179] either to improve the accuracy of the cost model or to directly search for the solution using blackbox-tuning. Although such approaches can potentially learn the scheduling space accurately, their computational cost is significant due to both the cost of evaluating enough schedules to learn a model as well as the cost of learning based algorithms. As a result, these approaches typically apply to existing hardware or analytical models where large-scale measurement is feasible.

Constrained-optimization approaches contrast with exhaustive search and learning-based algorithms, in that they formulate scheduling problems as a numerical optimization problem to determine variable assignments subject to a given set of constraints and objective functions. Popular techniques, such as Mixed Integer Programming (MIP), have demonstrated their applicability to solve large-scale and complex problems. In particular, polyhedral transformation has leveraged constrained-optimization based approaches for auto-vectorization and loop tiling [5, 6, 15, 21, 75, 116, 165]. These polyhedral optimizations focus on testing the feasibility of a transform and offering information to guide iterative searches. On the other hand, [92, 129] leverage the regularities in the ML workloads and hardware to formulate the mapping as optimization problems, which can then be directly solved by off-the-shelf solvers.

Summary (Sec. 5.3. Finding Performant Mappings): A comprehensive set of strategies has been developed to address the challenge of mapping DNNs on accelerators and general-purpose processors. The techniques originally developed to target CNNs can be applied to Transformers as the key operations are also tensor algebra operations. At the graph level, operator fusion is an important optimization technique that encodes a vast mapping space to decide how the execution of layers are overlapped. At the operation level, mapping strategies can be broadly categorized as either search strategies either random or feedback-driven or optimization or heuristic strategies.

5.4 Performance

Modeling of Mappings Performance models can provide the mappers with performance feedback for different mappings without executing mappings on real hardware or running simulations on accelerators under development. They can significantly reduce the evaluation costs for mappers and be used as performance proxies to optimize mappings. Different performance models offer different levels of fidelity, runtime costs, target workload scopes, and compatibility for various mapping algorithms. The selection of the performance model is both mapper and target workload dependent.

For Transformers, the mappers can use domain-specific polynomial [92] and analytical models [122, 146, 153, 163] to provide fast comparisons among mappings. These models leverage known iteration space bounds in tensor algebra workloads, as well as statically analyzable data access patterns, to estimate performance. The polynomial models expressed in mathematical forms can also be used directly as the objectives in optimization-based mappers.

Another class of popular performance models involves datadriven ML models [33, 84, 106]. Instead of building the performance model analytically to express known relations between mapping decisions and performance, these models use statistical techniques to iteratively fit a model to the mapping performance data collected over time. They typically require large amounts of data in order to learn and provide accurate predictions. Once trained, they can be easily integrated with ML-based mappers [84, 199].

Table 6: State-of-the-art DNN schedulers for heterogeneous accelerators.

The major drawback of prior models is that the generated mappings might not perform optimally (or even well) on the actual accelerator since the models can fail to capture the implementation differences in the hardware accurately. Cycle-exact software models based on real hardware implementation can achieve higher fidelity [187, 218], as can FPGA emulation using platforms such as Firesim [105], which can be used to model hardware in development. However, such platforms require more than a set of problem dimensions and a description of mappings; they require a stream of explicit instructions.

Generating this stream of instructions requires that one account for a large number of edge cases. For example, a simple tiling operation for a matmul − representable as a single line of tile sizes in Timeloop [163] − requires both the insertion of instructions specifying memory movement between different levels of the memory hierarchy as well as the generation code for edge cases that appear when matrix dimensions are not evenly divisible by the tile size. Furthermore, the codesign process requires this process to be automatable. In other words, each mapping must be translatable to code automatically.

As a result, code generation [33, 126, 180] tools are used to actually implement mappings onto hardware (or simulators). Many of these tools integrate not only a specification of the hardware backend but also mapping decision algorithms, often tuned for that hardware target. Such tools can also be useful for neural architecture search (NAS) in order to obtain accurate performance numbers for a given hardware architecture to guide automated DNN architecture design (see Sec. 6.1 for details). However, these tools are difficult to adapt for a codesign framework where both the mapspace and the hardware target can vary.

In order to address this problem, user-schedulable languages such as Halide [179], TVM [33], Rise/Elevate [78], and Exo [95] have been developed. These tools take as input a description of the computation to be performed and a point in the mapspace. They are generally defined by a set of rewrite rules, representing certain transformations such as splitting and rearranging loops, replacing appropriate loops with ISA instructions, and fusing loops. These languages also allow the user to specify and customize the hardware instruction set and seamlessly convert found schedules into executable code by representing these points as a series of rewrite rules [158, 249].

Summary (Sec. 5.4. Performance Modeling of Mappings): Performance estimation of Transformers running on novel architecture is essential for finding an optimal algorithm, mapping, and hardware combination. There are various open-source performance models available to estimate the mapping performance on hardware, ranging from domain-specific analytical models and data-driven ML models to cycle-exact models. The selection of the performance model for Transformer depends on the target workload size, hardware complexity, and development stage. Additionally, there are many mature code generation tools one can leverage to optimize the Transformer design for off-theshelf hardware.

5.5 Transformer vs CNN Mapping

Prior work discussed in Sec. 5.3 for finding good mapping strategies largely focuses on mapping CNNs onto accelerators or generalpurpose hardware. As with CNNs, the vast majority of cycles for Transformers are spent on matmuls from the MHA and FFN modules. In essence, existing mappers for CNNs can easily extend to scheduling Transformer matmuls. However, as we have discussed in Sec 3.4.2, Transformer blocks include LayerNorm and Softmax operations, which can be computationally non-trivial in certain realistic scenarios (which was also observed by [42, 168]). In turn, the presence of these operations imposes constraints on scheduling the preceding and succeeding matmuls. This leads to a much more complex problem for scheduling optimizations for Transformers overall. In this subsection:

We characterize the mapspace of Transformer blocks in comparison to that of CNNs (Sec. 5.5.1).
We take a deeper dive into the issue of increased scheduling complexity of Transformer matrix operations due to the presence of LayerNorm and Softmax (Sec. 5.5.2).

5.5.1 Mapspace Characterization of Transformers.

We empirically characterize the search space of legal mappings for representative Transformer and CNN architectures. To do so, we chose BERT [52] and ResNet50 [82], respectively. A total of 100K random valid mappings were searched via the Timeloop mapper [163], and the estimated latency and energy were measured by the Timeloop which takes part in expanding the hidden dimension to four times its value.

Figure 18: A comparison of the mapspace of (Left) BERT and (Right) ResNet50. Distributions of 100K randomly sampled valid mappings are shown. Both distributions show a similar range of EDP of up to four degrees of magnitude difference with respect to the best (minimum) observed value. Neither distribution is significantly more skewed towards lower or higher relative EDP. Overall, we find that mapspaces for BERT matmuls and ResNet50 convolutions are similarly vast in size with no significant difference in the shapes of their distribution. This indicates that brute-force or random search for BERT matmul scheduling is equally as challenging as in the case with ResNet50 operators.

From ResNet50, we choose convolution operations of varying kernel sizes and shapes, spanning \(1 \times 1\), \(3 \times 3\), and \(7 \times 7\) convolution kernels. Specifically, we use the \(7 \times 7\) kernel convolution with output channel size 64 and stride 2 in the conv1 layer of ResNet50, the \(3 \times 3\) kernel convolution with 512 channels and \(1 \times 1\) kernel convolutions with output channel sizes 512 and 2048 that belong to the final convolution layer conv5_x. These four particular convolutions vary in channel and kernel sizes and reasonably represent convolutions found in ResNet50.

From our mapspace analysis, we observe that both BERT and ResNet50 mapspace have a similar range of potential energy-delay product values from randomly sampled mappings. Accounting for variations between operators for each architecture, the distribution of EDP values for themselves are also largely similar, in that the portion of Pareto-optimal mappings with lower EDP values are small for both BERT and ResNet50 operators.

As an alternative visualization, Fig. 19 compares the empirical cumulative distribution functions (CDFs) of the same set of 100K random mappings. Here, we closely examine the difference in the CDFs near the regime where near-optimal mappings are found. We observe that the relative EDP value corresponding to the tenth percentile is 7.44 for the \(3 \times 3\), 512 convolution and 12.06 for the \(1 \times 1\), 2048 convolution kernels. BERT matmuls for MHA and FFN projection had tenth percentile values of 9.42 and 9.84, respectively. Alternatively, we also examine the percentage of mappings with relative EDP values less than 3 times the observed minimum. This percentage represents a rough upper bound on the number of mappings that can be safely labeled as near-optimal, and the difference in percentages signify relative difficulties in searching for optimal mapping for different operators. We find 1.58% for the \(3 \times 3\), 512 kernel and 2.62% for the \(1 \times 1\), 2048 kernel with mappings within this range of EDP. In the case of BERT, the MHA projection matmul is shown to have 1.70% of mappings with relative EDP less than 3 with 1.48% of mappings in this range for the FFN matmul.

Figure 19: Comparison of empirical cumulative distribution functions (CDFs) for Transformer matmuls and ResNet50 convolutional operations around the regime where near-optimal mappings are found. The 10th percentile values for each relative EDP distribution are: 7.44 for the \(3 \times 3\), 512 convolution kernel; 12.06 for the \(1 \times 1\), 2048 kernel; 9.42 for the BERT MHA matmul; and 9.84 for the BERT FFN matmul. Results show that the percentage of near-optimal mappings for BERT matmuls are similar if not smaller than that of ResNet convolution kernels. This indicates that the search problem of finding optimal mappings can be as challenging for BERT matmuls.

Model. The target spatial hardware architecture is the Gemini systolic generator [70]. Both the BERT and ResNet50 models are assumed to have been 8-bit integer quantized. We assume an input sequence length of 512, which is a typical assumption for BERTBase and BERT-Large models.

EDP Distribution Analysis. Fig. 18 demonstrates the comparison between the mapspaces for BERT (Left) and ResNet50 (Right). For the BERT MHA module, results in the figure correspond to mappings for the matmul performing each of the query, key, and value projections. For the FFN, we take the matmul in the 𝑊1 projection,

5.5.2 Scheduling Complexities of LayerNorm and Softmax.

While we find that matmuls in Transformers are already non-trivial targets for which to obtain efficient execution schedules to execute on DNN accelerators, the problem is further complicated by the presence of several non-linear operations, including LayerNorm and Softmax that are interposed between different matrix operations. When pursuing more aggressive optimizations, an enticing strategy is to fuse relatively high-arithmetic-intensity matmuls with the low-arithmetic-intensity normalization operations following them, such as LayerNorm and Softmax. This can be especially enticing in handling quantized workloads, where partial sums awaiting normalization are often of much higher bitwidth than the final normalized outputs. Architects familiar with CNN-type accelerators may find this especially intuitive, since convolutions are often fused with ReLU or max-pool operations.

Similarly, for Transformer Encoders, we could overlap the execution of normalization operation and the previous matmul, yet this is possible only with additional hardware support and appropriate constraints on the matmul execution schedule. To enable complete latency-hiding of nonlinear operations, the tiling factor size of either output dimension of the matmul must be maximized, so that rows/columns are immediately ready and stored at the Gemini accumulator scratchpad for computing the mean and standard deviation. We refer to this alternate scheduling approach as fusion-optimized scheduling.

On the other hand, in memory-constrained edge devices, the strategy is (somewhat unintuitively) counter-productive. The normalization operations found in Transformers often require long vectors of data to be resident in on-chip local memory before any normalized output element can be produced. Furthermore, when fusing matmuls with these normalization operations, awkward matmul tile shapes are typically required. These awkward tile shapes are often much larger in either dimension, as opposed to being square-shaped, and such skewed tile shapes tend to yield far worse arithmetic intensity. This greatly reduces the performance of the matmuls, and it may increase the total memory traffic, even accounting for the high bitwidths of the unnormalized partial sums which must be sent to and from outer memory when fusion is not enabled. In Fig. 21, we take a deeper look at the performance implications of fusion-optimized scheduling for BERT matmuls. We consider the BERT-Base encoder with hidden dimension 768 and 12 attention heads. By default we assume a sequence length of 512. As the target hardware, we take the 16×16 weight-stationary systolic array Gemini with custom hardware units for I-BERT implementations of nonlinear operations (activation, normalization, etc.), as described in Sec. 3.4.4. The total latency of each adjacent pair of matmul and LayerNorm/Softmax operations is estimated via Timeloop[163]. Opportunities for overlapping computations include:

(1) the MHA query × key matmul and following Softmax;
(2) MHA 𝑊out projection and following LayerNorm; and
(3) FFN 𝑊2 projection and following LayerNorm.

We compare two scheduling strategies. In the first strategy, we use Gemini’s default heuristic-based scheduler, which greedily maximizes loop tile factors at the local SRAM level for each of the three matmul dimensions. In this approach, we do not attempt to overlap the matmul computation with the following nonlinear operation, meaning that the matmul is scheduled independently as a standalone task without considering the subsequent operations. This can lead to inefficient utilization of resources because the matmul computation completes before starting the nonlinear operations, leading to idle cycles in the pipeline.

In the second strategy, we employ a more integrated scheduling approach that aims to overlap the matmul computation with the following nonlinear operations. By doing this, we can better utilize the available resources and reduce idle times, improving the overall efficiency of the computation pipeline. This approach requires careful coordination between the matmul and the nonlinear operations, ensuring that data dependencies are managed effectively and that the computations are scheduled in a way that maximizes concurrency.

Figure 20: Distributions of random, valid mappings for ResNet50 operators, compared against distributions for Transformer MHA and FFN matrix multiplications, where the size of matmuls are tuned so that total MACs are equivalent. The MHA matmul dimensions were calibrated to \(240 \times 240 \times 512\) and similarly for the FFN \(W_1\) projection matmul, set to \(480 \times 120 \times 512\) for \(d_{\text{FFN}} = 4 \times d\). Comparing against Fig. 18, we see that the EDP distributions are still similar for the ResNet convolution and BERT matmuls. This implies that, even after accounting for differences in the total number of MACs, the mapspace of BERT matmuls exhibit as vast a range of relative EDPs as the mapspace of CNN convolution kernels.

Overall, the analysis of EDP distribution from randomly sampled valid mappings indicates that BERT matmuls, despite having fewer loop levels for tiling and re-ordering compared to convolutions, are as challenging to schedule as CNNs. As much as graph and operator-level scheduling had a significant impact on end-to-end performance and efficiency of CNN inference, the same importance of appropriate scheduling also applies to Transformer matrix operations.

EDP Distribution Analysis with Fixed Total Number of MACs As an additional analysis on mapspace characterization, we further force the total number of MACs to be fixed. This enables an even fairer comparison between the distributions of mapping results for Transformer and ResNet50 operators. We continue to assume the Transformer input sequence length to be 512 and the feed-forward network expansion ratio of 4 times the hidden dimension size. To keep the number of MACs equal, we calculate the hidden dimension size that would yield the same total MACs as for the ResNet50 conv1 layer’s \(7 \times 7\) kernel with output channel dimension 64. For the matmul in the query projection of MHA, the corresponding hidden dimension size was 240. Similarly, for the matmul in \(W_1\) projection of the FFN block, the corresponding hidden dimension size was 120. To elucidate the comparison between synthetic BERT layers and actual ResNet50 convolutions, we plot the corresponding pairs mapping distributions in Fig. 20. Even after forcing equivalent numbers of MACs, we see that the range of relative EDP values are similar between BERT matmuls and ResNet50 convolutions. This finding further accentuates how complex the scheduling problem can be for matmuls found in Transformer models.

Figure 21: Impact of fusion-optimized scheduling for BERT MHA that enables latency hiding of LayerNorm and Softmax. Results are based on the BERT-Base architecture and the Gemini accelerator. (Left) Input sequence length is assumed to be 512, and the accumulator SRAM size is increased from 128kB to 256kB. Hiding the Softmax latency improves combined matmul and Softmax latency by 78%. However, overlapping 𝑊out projection with LayerNorm can either hurt or improve total latency, depending on the accumulator size. Overall, fusion-optimized scheduling for both matmuls in MHA yields 23% and 52% latency improvements for accumulator sizes 128kB and 256kB, respectively. (Right) The input sequence length is increased to 4096. Again, we see that overlapping the query × key matmul with Softmax improves latency by 22%. Overall, fusion of both MHA matmuls with nonlinear operation yields a 21% latency improvement.

Regardless of accumulator size. In particular, we see that Softmax latency is significant compared to the matmul, taking up around 78% of the total cycles, and hiding this latency significantly reduces total latency. At the same time, the query × key matmul latency is relatively unchanged by the additional scheduling constraints, mainly because inner dimension of the matmul is small (𝑑/𝑙 = 64 for BERT-Base). On the other hand, the mapping constraints from fusion-optimized scheduling significantly harm the execution latency of the 𝑊out projection matmul after fusing with the following LayerNorm, resulting in 83% worse latency than the non-fused schedule. However, once the accumulator size is doubled, the performance hit on matmul scheduling is alleviated. Increased accumulator SRAM size of 256kB allows more partial sums to be stored in the buffer instead of spilling to DRAM, thereby reducing total latency by 4%.

In the right plot of Fig. 21, we further investigate the impact of sequence length on fusion-optimized scheduling for the MHA block. Here, the sequence length is increased from 512 to 4096, which impacts the ratio of cycles from matmuls, Softmax, and LayerNorm in the MHA block. In particular, note that the size of the query × key matmul and the Softmax computation depends quadratically on the sequence length, while the other matmul and LayerNorm exhibit a linear dependence. When fusing the query × key matmul with the subsequent Softmax, the mapping constraints worsen matmul performance despite a larger (256kB) accumulator size. This is because with increased dimensions of the query × key matmul and forced tiling factors, the scheduler can no longer avoid tiling at the DRAM level. However, by overlapping the Softmax operation and thereby eliminating the need to load and store intermediate activations (which quadratically scales with the sequence length), the latency increase from the query × key matrix can be offset, resulting in an overall 22% reduction in latency.

On the other hand, Fig. 22 shows the results on matmul and LayerNorm overlapping in the FFN 𝑊2 projection. Even with a larger accumulator size and in both sequence lengths, we consistently if it were executed on its own. In Fig. 21, we denote this approach as non-fused scheduling. The second strategy is the aforementioned fusion-optimized scheduling.

Figure 22: Impact of fusion-optimized scheduling for BERT FFN matmul that enables latency hiding of the LayerNorm operation. Input sequence length is varied from 512 to 4096. We observe that fusion-optimized scheduling hurts total latency by 27% in both cases. This motivates the need to carefully evaluate the impact of chaining matmul and LayerNorm execution on systolic arrays since the impact of mapping constraints may outweigh the gains from latency hiding of nonlinear operations.

The left plot of Fig. 21 summarizes how matmul and nonlinear operation fusion within the MHA block can be influenced by the accumulator SRAM size. In this experiment, while the on-chip scratchpad for input activation and weights is held fixed at 256kB, the output activation accumulator size is increased from 128kB to 256kB. We note two findings: first, within MHA, fusing query × key matmuls with Softmax for each attention head reduces latency observe that fusion-optimized scheduling worsens total latency by 27%. Together with previous findings, we see that latency improvements of fusion-optimized scheduling are dependent on the accumulator SRAM size and sequence length. Furthermore, we find that, in the BERT-Base scale, it is consistently favorable to overlap the MHA query × key with the ensuing Softmax but consistently disadvantageous to chain the FFN 𝑊2 projection matmul with LayerNorm. This is in contrast with previous studies on GPU kernel fusion for Transformers [42, 168], and it highlights how scheduling for Transformer matmuls becomes more complex when targeting different styles of custom hardware designs, including the Gemini accelerator.

Summary (Sec. 5.5. Transformer vs. CNN Mapping): Here are the high-level takeaways from this section.

Scheduling for Transformer matmuls is as challenging as scheduling CNN convolution operators. Both mapspaces have similar distributions of relative EDPs and similar percentages of near-optimal mappings. Brute-force or random scheduling is not simpler for Transformer matmuls, despite them having fewer loop levels than convolutions.
The presence of nonlinear operations such as LayerNorm and Softmax present additional complexities to the scheduling problem for Transformer matmuls. Latency of these nonlinear operations can be hidden by fusing its computation with the preceding matmul. This requires additional hardware support, as noted in Sec. 3.4.4, and it imposes constraints to the matmul scheduling.
Whether this fusion-optimized scheduling yields endto-end latency improvements depends on the Transformer and underlying hardware parameters. In particular, we observe that:
- (1) size of the on-chip SRAM buffer for output activation; and
- (2) sequence length matter.
We consistently observe that overlapping the execution of query × key matmul with Softmax in the MHA block reduces latency up to 78%, compared to executing the two operations separately on a systolic array accelerator. On the other hand, scheduling to overlap the FFN 𝑊2 projection with the following LayerNorm hurts performance by 27%.

6 ADAPTING TRANSFORMER ARCHITECTURE WITH NAS

So far, we have conducted an in-depth exploration of the full-stack aspect of DNN inferencing, with a focus on the Transformer architecture, from the hardware level to optimization and scheduling strategies to improve their inference performance. Another important avenue in full stack optimization of DNNs is obviously to optimize DNN architecture itself and to tailor it for a specific hardware platform.

In this section, we will primarily focus on automated neural architecture search (NAS) as a method for designing DNNs. Sec. 6.1 will provide a general overview of NAS, and then Sec. 6.2 will explore hardware-aware NAS methods. These two subsections will be mainly focused on NAS techniques for CNNs, as NAS was initially introduced and extensively researched from the pre-Transformer era. However, we believe it is helpful to provide a comprehensive overview and background to understand NAS. In Sec. 6.3, NAS methods specific to Transformer architectures will be discussed. Finally, in Sec. 6.4, a case study of applying NAS method in the scenario of optimizing Transformer inference on a target hardware architecture will be provided.

6.1 Neural Architecture Search

Typically, DNN architectures are designed and trained to achieve the maximum accuracy for a given task, without necessarily considering the target hardware or inference latency, memory, and power requirements. However, often there exist several different variations of the DNN architecture which result in the same accuracy but have better hardware performance.

There is a rich literature in this area. Notable works here include MobileBERT [205], which is one of the earliest attempts, and which adopts the bottleneck structure to design a thinner version of Transformer, as well as Lite Transformer [232], which proposes the Long-Short Range Attention, in which a group of heads are replaced with convolution operations to capture short-range contexts more efficiently. SqueezeBERT [94] is another work that incorporates grouped convolutions into the Transformer architecture to reduce the model size and latency. This approach is not limited to NLP, and similar models have been proposed in computer vision (CV) [24, 36, 130, 152] and speech recognition [23, 76, 109], to name just a few.

It is often very difficult to find these DNN architectures since the search space is exponentially large, even without considering the underlying hardware platform. Even for those with expertise in DNN architecture design, the impact of an architectural change on accuracy and runtime performance can be nontrivial to predict. As such, automated NAS methods have been proposed to adapt a DNN architecture for a given constraint. However, it is critical to note that NAS methods often require prohibitive amounts of compute and trials before finding a candidate architecture. For instance, in one of the early NAS works [255], finding an optimized CNN took 22,400 GPU-hours. Moreover, NAS methods are not yet fully automated, and they often require hand-tuning the search space.

Broadly speaking, a NAS framework consists of three main components: search space; search method; and evaluation method [18, 61]. The search space consists of a set of valid operations (e.g., convolution, pooling, activation, etc.) and their connectivity that define valid DNN architectures, from which a candidate model can be drawn. Prior knowledge and human intuition regarding good DNN designs is often necessary in order to restrict the search space and improve the search efficiency. The search method defines how to explore the search space. Exhaustive search is obviously intractable. Therefore, it is critical to have methods for quickly exploring the search space and sampling candidate architectures. The evaluation method is a way of assessing how well candidate architectures perform on unseen data. The most naive method is to evaluate all candidate architectures after the full training process is complete.

6.1.2 Search Method.

Since the NAS search space is usually too large for an exhaustive search, efficient search methods are necessary to ensure overall performance. In early work on NAS, RL-based methods were used as the search method [16, 170, 250, 254, 255] (Fig. 24, a). At a high level, RL-based NAS frameworks contain the controller (i.e., RL agent) that takes an action of sampling DNN architectures, whose evaluation accuracy after training is fed into the controller as a reward signal to refine its sampling policy. The controller can be trained using different RL algorithms such as policy gradient [254] or Q-learning [16].

An alternative search strategy for NAS is evolutionary search [141, 182] (Fig. 24, b). Here one initializes a population of different DNN architectures, which are then mutated (e.g., by adding, removing, or changing layers), evaluated, and selected based on their validation accuracy in every evolution step. This generates a new population for the subsequent evolution step. The search cost for evolutionary search can be quite expensive, as it requires validating all DNNs in the population for every evolution step. Therefore, it is often coupled with various methods to reduce validation costs such as weight sharing. These will be discussed in more detail in Sec. 6.1.3. The aforementioned methods can be regarded as a black-box optimization problem over a discrete search space. Due to the discrete nature of the search space with a large number of tunable knobs, the search cost can become prohibitively large. This is further exacerbated by the long evaluation time of a single RL or evolution iteration, which often requires training from scratch. For instance, RL-based NASNet [255] and evolutionary search-based AmoebaNet [182] require a few thousands of GPU hours for end-toend search [183]. In contrast, DARTS [142] proposes the continuous relaxation of the search space, which allows them to efficiently explore and optimize the search space through gradient-based optimization methods (Fig. 24, c). In essence, DARTS introduces a trainable weight to allow for a weighted average of multiple operations, instead of requiring a selection of a single operation. This weight can be trained alongside other model parameters during training, and it can eventually converge to favor a particular operation over the others. This method reduces the search cost from thousands of GPU hours in the preceding RL or evolutionary search based methods to a few hours. Due to the search efficiency, the gradient based search has become a popular choice for many NAS frameworks [219, 228].

One of the main challenges with NAS methods is the prohibitive training cost. To address this, ENAS [170] proposed weight sharing. ENAS views a DNN model as a directed acyclic graph, where the nodes represent the computation with their own trainable weights and the edges represent the information flow from one node to another. Then, an individual candidate DNN can be regarded as a sub-network of a larger, over-parameterized super-network (supernet). This redefines NAS as a process of searching for good sub-networks sampled from the supernet whose weights are shared across all sub-networks. Once the supernet is trained, its sub-networks can be sampled and

Figure 23: Illustration of the general structure of NAS frameworks. Candidate DNN architectures are sampled from the search space according to the search method, and then they are evaluated. The evaluation result is then used by the search method to guide better exploration of architectures in the search space.

However, this incurs a large overhead, and more efficient methods of estimating performance are often used as a proxy for the final accuracy. Fig. 23 schematically shows these different components. Below, we discuss each of these components in more detail. Note that the main purpose of this section is not to conduct a thorough survey of existing works, but instead to provide a broader overview on various methodologies for improving NAS from a practitioner’s standpoint. We refer readers to [18, 61, 183, 193] for more comprehensive survey on NAS.

6.1.1 Search Space.

The search space for NAS defines a set of valid DNN architectures over which the NAS framework can search. Designing a proper search space is critical, as its size and coverage can directly affect the final outcome of the NAS framework. One naive principle of designing a search space is the layer-wise search [26, 77, 254] where each layer (or operation) can be searched independently from other layers. For instance, in [254], the RNN controller model produces a description of individual layer in a sequence to construct a candidate DNN architecture.

However, the layer-wise search space often suffers from the large search space size that grows exponentially with the depth of candidate architectures, and this could degrade the search efficiency and the final performance. The cell-wise search [54, 141, 170, 250, 255] can alleviate this shortcoming by searching cells (i.e., blocks or modules that consist of multiple layers) rather than an entire architecture, which can later be stacked up repeatedly to compose an architecture. This is motivated by many successful hand-designed DNN architectures that consist of repeating cells or blocks of a similar structure [82, 89]. NASNet [255] is one of the earliest works that proposes to search two types of cells: the normal cell, which stacks up multiple times without changing spatial resolution; and the reduction cell, which is inserted once every fixed number of repeated normal cells, in order to reduce the spatial dimension towards the output layers. This significantly reduces the search time by 7× compared to the previous layer-wise search method proposed by the same authors [254]. Likewise, the cell-wise search space substantially reduces the search space (as cells are much smaller than the whole network) by imposing an additional structural constraint evaluated without the need to train the models from scratch. This significantly reduces the overall search cost.

Figure 24: Comparison of different NAS search methods. (a) RL-based methods employ a controller that samples architectures based on a policy, which is reinforced by the evaluation results of the sampled architecture as a reward. (b) Evolutionary search-based methods initialize a population, sample them based on the evaluation results, and then generate a next-round population by mutating the remaining architectures. (c) Gradient-based methods (e.g., continuous relaxation) train weights along with model parameters that are multiplied to each operation choice. After the training, the weights are converged to favor a particular operation over the others, thus approximating the sampled architecture.

This method, also known as the supernet-based NAS, was picked up by several subsequent algorithms [17, 25, 26, 77, 142, 228, 243]. In particular, Single Path One-Shot NAS [77] constructs a supernet by stacking the choice blocks. The choice block consists of multiple operation choices (e.g., convolution with different kernel sizes or skip operation) from which a single operation can be selected at a time. For every training step, a different sub-network is obtained and trained by uniformly sampling one operation for each choice block, expecting all sub-networks with different permutations of choices to be trained fully and equally. After training, an evolutionary algorithm is applied to search optimal sub-networks from the supernet, without paying the expensive costs of from-scratch training.

However, the accuracy of sub-networks obtained from a fullytrained supernet is typically inferior to the same model architectures trained from scratch in a stand-alone fashion [17]. Therefore, the discovered sub-network architectures often need to be re-trained. To address this, Once-For-All [25] proposes the progressive shrinking algorithm, and BigNAS [243] proposes the sandwich rule and in-place distillation. Both aim to train a supernet in a way that its sub-networks can achieve good accuracy (i.e., comparable accuracy to the from-scratch trained counterparts) without an additional training process. These methods can have a high value from a practical standpoint as sub-networks can be sampled (e.g., via evolutionary search) and immediately deployed.

6.1.4 Evaluation Method.

One needs a metric to evaluate sampled architectures on a validation dataset to rank the “goodness” of candidate architectures. The early NAS algorithms [16, 254] fully trained sampled architectures until convergence, which is not feasible for large datasets. A widely adopted strategy for applying NAS to larger-scale tasks is to discover an accurate cell architecture using a smaller dataset (e.g., CIFAR-10 in computer vision) and then apply it to building a larger model for a larger dataset (e.g., ImageNet) [140, 142, 149, 182, 210]. The premise here is that a DNN architecture optimized for one task can be transferred well to other tasks in a similar domain. This premise has been challenged by some of the recent NAS work [26]. Supernet-based NAS algorithms can be a good alternative to avoiding the use of proxy tasks [17, 25, 26, 77, 142, 228, 243]. These algorithms require only a single iteration of supernet training, which can be performed directly on large-scale datasets without prohibitive compute requirements.

Summary (Sec. 6.1. NAS): Neural architecture search (NAS) is a promising alternative to hand-designing efficient DNNs. NAS consists of: (1) a search space that defines valid candidate architectures; (2) a search method that defines how to efficiently explore the search space; and (3) an evaluation method for evaluating the goodness of candidate architectures. Despite its potential, NAS presents its own set of challenges, which often necessitate manual tuning of the search space, and which can be prohibitively expensive in terms of time and resources. To address this, many recent advances in the NAS community have focused on improving search efficiency. Notable methodologies include: (1) the cell-based search that confines the search space size; (2) the continuous relaxation of the search space that allows efficient gradient-based optimization methods; (3) the weight sharing scheme across candidate architectures; and (4) faster evaluation methods for the candidate architecture performance.

6.2 Hardware-aware NAS Hardware-aware

NAS aims to optimize not only the accuracy of DNNs but also the various performance metrics (such as latency, energy consumption, or memory usage) on target hardware platforms. One key question here is how to incorporate these metrics into learning. It is often difficult to quickly measure the latency or energy consumption of a candidate model. As such, most works in the literature only consider FLOPs or total number of parameters. However, as also discussed above in Sec. 2.2, FLOPs does not necessarily have a correlation with latency or energy. Therefore, multiple hardware-aware NAS frameworks have been introduced to directly consider latency instead, or to use approximate metrics for measuring it (e.g., measuring latency of individual layers and accumulating them to approximate total latency, as opposed to measuring the end-to-end runtime). Here, we discuss popular strategies to incorporate hardware performance into NAS frameworks. For a more exhaustive survey on hardware-aware NAS techniques and their algorithmic details, see [18].

The most straightforward way is to directly measure hardware performance and bring it as an additional optimization objective for NAS frameworks [138, 210]. For instance, MNasNet [210] extends the existing RL-based NAS framework to the multi-objective optimization setting. It aims to maximize accuracy, while limiting latency on the target platform to less than a certain target latency. Instead of solving the multi-purpose optimization problem, it combines the two objectives (accuracy and latency) into a single objective, by taking a weighted product. This modified objective is then provided as a reward for updating the controller. By directly optimizing latency on the target platform, MNasNet finds DNN architectures that are ∼ 2× faster than MobileNetV2 [188] and NASNet [255] with a comparable ImageNet classification accuracy on a Pixel phone.

Another notable work is MCUNet [138] that targets searching DNNs for resource-constrained microcontrollers. Unlike GPUs or mobile devices, microcontrollers for tiny IoT devices lack large memory and storage. As a result, it is critical to design a model that fits their tight memory budgets. MCUNet incorporates its supernetbased NAS framework [17, 77] with TinyEngine, a lightweight inference engine that the authors have developed as part of the project. In this way, MCUNet samples sub-networks for every evolutionary step and feeds them to TinyEngine to optimize the memory scheduling and measure the optimal memory usage.

However, due to the limited number of available devices, measuring hardware performance directly can be slow and not parallelizable [238]. Furthermore, it is not possible to pre-measure the hardware performance for all possible DNNs in the search space [228]. To overcome this issue, some of the hardware-aware NAS methods incorporate operation-wise lookup tables [219, 228, 238]. Rather than storing the end-to-end hardware performance, the lookup table only contains pre-measured performance numbers of individual operations which can be summed up to estimate the overall performance of a given DNN. In FBNet [228], for instance, the latency number estimated from a lookup table is used as a regularizer term in its gradient based NAS framework to penalize operations that would be slow on the target hardware device.

Finally, some hardware-aware NAS frameworks rely on lightweight prediction models that can quickly predict hardware performance numbers for a given DNN. For instance, ProxylessNAS [26] has trained a model that takes as inputs a DNN configuration (e.g., operation types, input and output shapes, and other operation attributes like kernel sizes) and outputs the estimated latency on the target hardware platform.

Summary (Sec. 6.2. HW-Aware NAS): Hardware efficiency metrics can be coupled with NAS loss function to find an architecture that considers both accuracy as well as latency (or similar metrics). While directly measuring performance metrics on a real hardware environment is the most accurate method, it can be slow and poorly parallelizable. Instead, the hardware performance can be estimated in high accuracy using an operation-wise lookup table or by training a simple prediction model.

Table 7: Summary of existing literature on Transformer-specific NAS techniques. SPOS and OFA stand for Single Path One-Shot [77] and Once-for-All [25], respectively.

6.3 Transformer-specific

NAS Early work on NAS focused on CNN models mostly for computer vision applications. However, after the Transformer architecture was introduced and matured enough to achieve state-of-the-art results not just for NLP tasks but also for other tasks, several works started to explore NAS methods to find more efficient alternatives. With the introduction and maturation of the Transformer architecture, which allowed for state-of-the-art results on a variety of tasks, a number of recent works have begun to explore the use of NAS methods to find more efficient alternatives. As the Transformer architecture was initially developed for NLP tasks, the earliest NAS works for Transformers were primarily in this domain.

Evolved Transformer [200] was one of the earliest attempts to apply NAS for searching better Transformer architectures, and it did so via an evolutionary search algorithm. Inspired by NASNet [255], Evolved Transformer adopts the cell-wise search space to search two cell structures. Each of these cell structures can be stacked for multiple times to form the encoder and decoder of the encoder-decoder Transformer architecture. The cell structure contains a stack of multiple blocks, and each block has its own hyperparameters such as operation type, normalization type, and dimensions which can be searched. The main challenge here is that NLP tasks require a much longer time to train and evaluate (e.g., the popular WMT 2014 En-De translation benchmark contains over 3 million sentence pairs). Furthermore, unlike the previous works [140, 142, 149, 182, 210] for CNNs that found CIFAR-10 to be a reasonable proxy for much larger ImageNet, these NLP tasks do not typically have good smaller proxy tasks. To address this, Evolved Transformer proposes to dynamically allocate resources to more promising architectures by early stopping those who fail to achieve the hurdle fitness within a small number of steps.

Due to the large computational cost of training Transformers on NLP tasks, weight sharing and supernet based NAS have become popular options. HAT [222] extends the Once-for-All [25] scheme to Transformer architectures to train a single supernet from which sub-networks with different depths, number of heads, and dimensions can be sampled. Furthermore, HAT is hardware-aware, in that it directly optimizes for latency along with accuracy using a multi-layer latency prediction model. HAT shares the benefits of Once-for-All, which allows sub-networks to be sampled through evolutionary search and deployed immediately to target hardware devices without retraining.

NAS-BERT [235] is another supernet based NAS for Transformers that extends Single Path One-Shot [77]. Different from the aforementioned methods, NAS-BERT proposes a NAS method that can be applied at the pre-training stage of encoder-only BERT so as to be agnostic to downstream tasks. In order to avoid the prohibitive cost of directly performing architecture search in a big supernet on the heavy pre-training task, NAS-BERT employs two novel techniques: (1) block-wise training, that splits the entire supernet into multiple blocks of successive Transformer layers which are then trained separately; and (2) progressive shrinking, that dynamically prunes less promising sub-networks based on their validation loss. Primer [201] searches for a more efficient decoder-only Transformer for auto-regressive language modeling. Unlike the majority of NAS methods that view a model as a connection of multiple operations selected from a NAS search space, Primer views it as a single valid Tensorflow (TF) program comprised of fine-grained TF primitive operations like addition, exponential, convolution, and many others. Using evolutionary search, it targets to search a TF program defining a decoder block that can be stacked multiple times to form an auto-regressive language model. The hope is that this minimizes inductive bias when designing the search space, as the possible set of operations and their connectivity are no longer pre-determined by human experts. In order to reduce the heavy computational cost of auto-regressive pre-training, Primer brings the idea of hurdles of Evolved Transformer [200]. Additionally, it uses relatively small LM1B dataset as a proxy task to discover model architecture, which is then transferred to much larger target tasks such as PG-19 [176] and C4 [178].

The Transformer architecture, initially developed for NLP tasks, has been adapted for use in the field of CV. Referred to as Vision Transformers (ViTs) [56, 144, 215], these models have been demonstrated to outperform popular CNN architectures in various CV applications, thus driving research towards the development of NAS techniques to automatically design better ViT architectures. However, due to the architectural similarities, these works have much in common with the NAS methodologies for NLP-targeted Transformers. For instance, Autoformer [230] and ViT-ResNAS [135] are extensions of Single Path One-Shot [77] to the ViT search space, including depth, hidden dimensions, and the number of heads of each Transformer layer. Burgerformer [236] takes a step further to take into account the micro design, i.e., the type of operations, activations, and normalization, as well. NASViT extends BigNAS [243] and AlphaNet [221] to apply the sandwich sampling rule to train a supernet. GLiT [30] proposes a hierarchical NAS scheme for searching hybrid convolution-attention architectures. It determines the number of convolutional and multi-head attention heads in each layer in the first stage of NAS, as well as the detailed hyperparameters such as dimensions in the second stage.

Table 8: NAS Search Space.

One noticeable characteristic of most of the NAS methods introduced above for Transformer architectures (both for NLP and CV applications) is their use of supernet-based, weight-shared methodologies, which is summarized in Tab. 7. This is presumably due to the immense computational cost associated with training Transformer architectures. The use of supernet-based NAS can limit the range of architectures that can be discovered, due to the large constraints it imposes on the search space, which may prevent the discovery of unique or innovative architectures. Therefore, there is a need to explore better ways to balance the flexibility and efficiency of NAS techniques.

Summary (Sec. 6.3. Transformer-specific NAS): The existing NAS frameworks have been extended to design more efficient Transformer architectures. Due to the large computational cost for training Transformer models, which is even further exacerbated when combined with unsupervised pre-training methodologies, most of the existing methods heavily rely on the weight-sharing scheme followed by evolutionary search. A key challenge in Transformerspecific NAS is that existing work is primarily limited to tuning relatively trivial hyperparameters such as hidden dimensions, depth, and number of heads. However, this is likely to preclude the discovery of novel Transformer variants.

6.4 Case Study: Running NAS and Co-design

On the Transformer

So far, we have discussed the general concept of NAS, its application to hardware-aware scenarios, and its extension into the Transformer architecture. Here, we conduct a case study to demonstrate the performance gains of applying NAS to Transformer inference on Gemini, with the goal of optimizing not only accuracy, but also hardware costs such as latency and energy.

6.4.1 Experiment Setup.

As a baseline architecture, we use a 6layer Transformer architecture with all other model configurations remaining the same as BERT-Base or GPT-2 (see the details in Tab. 1). We consider Language Modeling task, and we train a randomly initialized model on the WikiText-2 [154] benchmark with 37k training examples and 4k validation examples using a language modeling training objective. To evaluate the model performance, we measured perplexity on the validation examples, excluding empty strings, where lower scores indicate better performance. The standalone baseline model was trained for 50 epochs with the Adam optimizer and a linear learning rate scheduling with a peak learning rate in the range {5, 2, 1, 0.5} × 10−5. The training examples are concatenated to reach a maximum sequence length of 512 and batched using a batch size of 16.

For NAS, we adopt the BigNAS-style [243] strategy to train a supernet, and then we used an evolutionary algorithm to search sub-networks out of the fully trained supernet. The NAS search space is comprised of various combinations of the number of layers 𝑙, number of heads ℎ, hidden dimension 𝑑, and FFN dimension 𝑑FFN (see Tab. 8 for details). For supernet training, we use the same training hyperparameters as the stand-alone training, except that in each training iteration, we sample four sub-networks: the largest possible; the smallest possible; and two randomly sampled subnetworks. The model parameter update is then performed using the sandwich rule, which involves taking the average of the gradients collected from the backward paths of these four sub-networks.

For the evolutionary search, we initialize a population of 40 sub-networks and perform 40 rounds of evolution iterations. In each iteration, the validation perplexity and energy-delay-product (EDP) of each sub-network on the target hardware are collected, and only the sub-networks that are Pareto-optimal are retained. Here, we use EDP as a single hardware cost metric, as it allows for the conversion of a multi-objective optimization problem into a single-objective optimization problem, by combining both latency and energy into one metric. The retained sub-networks are then mutated with a mutation probability of 0.2 to refill the population for the next iteration. To measure the hardware cost, we use a lookup table-based method for quickly assessing the latency and energy consumption of each sub-network on the target hardware, instead of using RTL (Register Transfer Logic) simulation, which can be time-consuming. The entries in the lookup table are obtained from Timeloop [163] simulations, which provide simulated latency and energy numbers for each operation. The end-to-end latency and energy are then estimated by summing the per-operation costs. After the evolutionary search, the Pareto-optimal sub-networks are then evaluated with an RTL simulator to obtain a more precise estimation of the latency. For the energy measure, we continue to use the numbers from Timeloop, as it is technically challenging to measure the energy consumption via RTL.

For the target hardware, we use Gemini with the optimizations applied in Sec. 3.4 with the dedicated normalization units for running non-linear operations on-chip. We configure Gemini with a scratchpad size of 64 kB and accumulator size of 256 kB based on the insights in Sec. 3.4.3 to maximize the accumulator size.

6.4.2 Experiment Results.

We show the NAS Pareto-frontier results for both latency and energy in Fig. 25 (blue curves) where each point corresponds to a different Transformer architecture that has been found from the evolutionary search algorithm discussed above. Additionally, we plot the baseline 6-layer Transformer model trained from scratch as a reference (× mark). All the EDP values are normalized with the baseline EDP.

We first present results from the evolutionary search process over EDP in Fig. 26. As can be seen in the plot, the NAS framework allows us to obtain multiple Transformer architectures with better hardware cost to perplexity trade-offs. That is, it finds architectures with similar or even better perplexity, as compared to the baseline with smaller hardware costs. As an example, we select the architecture with the lowest EDP while having less than +0.1 perplexity loss, whose EDP is 3.6 × 109 and perplexity is 22.51. The architecture parameters are listed in Table 9. This architecture illustrates the importance of a diverse search space, as the number of attention heads varies from 6 to 12 in each layer, and as the fully connected layer dimensions vary from 768 to 2560. By being able to change these parameters on a per-layer basis, one may discover more Pareto-optimal architectures compared to if these parameters were fixed for every layer.

In Fig. 25, we separate out latency and energy, and substitute in RTL values for the latency. As one can see, it is possible to attain a 1.4× reduction in latency versus the baseline Transformer trained from scratch with 0.1 point perplexity degradation. If one could tolerate about one point degradation in perplexity, latency can be reduced by 2.4×, and possibly even further with more advanced architecture search techniques. With regards to energy, one can attain a 1.6× improvement considering 0.1 point perplexity degradation, and 4.4× if perplexity is allowed to increase by 1 point. Taking both together, it is possible to reduce EDP by 2.2× with just 0.1 point perplexity degradation, and 10.6× with 1 point perplexity degradation. These examples illustrate the power of co-design in allowing practitioners to choose a combination that best matches their needs. It is important to note that this represents a single run of our co-design methodology on a specific hardware platform, and results may vary depending on the target hardware and optimization goals.

Summary (Sec. 6.4. NAS for Transformers Case Study): This case study used a supernet-based NAS to sample diverse architectures followed by an evolutionary search to discover Pareto-optimal architectures that trade-off between perplexity and energy-delay-product, a measure of runtime efficiency. Many discovered architectures have significant latency and energy improvements compared to the baseline trained from scratch when running on an optimized Gemini hardware accelerator. The importance of using NAS to explore the search space is underscored by the fact that many well-performing architectures use diverse limited understanding of the run-time characteristics and bottlenecks of Transformer workloads, as well as the design principles necessary for effectively running these models, in comparison to CNN architectures.

Figure 25: (Left) Latency-perplexity and (Right) Energy-perplexity plots of the Transformer architectures found via evolutionary search on our optimal Gemini hardware configuration. Similar to Fig. 26, lower perplexity indicates better performance, and we plot lines to illustrate +0.1 and +1 point perplexity degradation.

In this paper, we have conducted a comprehensive analysis of Transformer workloads, in order to better understand run-time characteristics and identify performance bottlenecks of Transformers running on commodity hardware and accelerators (Sec. 2). Furthermore, we have performed an extensive survey of the current hardware and software solutions, with the goal of identifying potential optimization opportunities in the full-stack deployment of Transformers. Specifically, our survey covered the following topics: - The design of hardware architectures for Transformer inference, including the impact of the non-linear operations on designing hardware accelerators (Sec. 3);

Optimization strategies such as pruning and quantization that can be applied to a fixed Transformer architecture for better performance (Sec. 4);
Mapping and scheduling of operations in the Transformer architecture and the associated challenges (Sec. 5); and
The use of automated NAS for designing more efficient Transformer architectures and adapting them to target hardware (Sec. 6.1).

The key findings from this study include:

Despite the small FLOPs count, the nonlinear operations in Transformers can be highly influential on overall performance, if they are not taken into account properly when designing domain-specific accelerators. The computation of nonlinear operations such as Softmax and LayerNorm also requires computing runtime statistics, whereas the BatchNorm operations in CNNs can be absorbed into prior convolutional layers during inference.
Hardware design for CNNs may not necessarily be the same as that for Transformers. For instance, increasing the accumulator sizes to enable higher output reuse yielded significant performance improvement in Gemini for Transformer applications. - It appears less complex to schedule matmuls in Transformers than convolutions in CNNs, due to the fact that scheduling matmuls involves 3 loops, as compared to 6 for convolutions.

Figure 26: EDP-perplexity plots of the Transformer architectures found via evolutionary search on our Gemini hardware configuration. Lower perplexity indicates better performance of the trained models. For better comparison, we additionally plot lines to illustrate +0.1 and +1 point perplexity degradation.

Layer configurations. When trained on WikiText-2 language modeling benchmark, these techniques found Transformer architectures with a 2.2× EDP reduction while tolerating a 0.1 point perplexity degradation, and 10.6× with a 1 point degradation over the baseline.

7 CONCLUSION

The Transformer architecture [217] has revolutionized the field of natural language understanding [52, 125, 143, 173, 174, 177, 240], which has been further accelerated with the recent development of large language models with hundreds of billions of parameters [22, 44, 58, 86, 175, 190, 198]. This architecture has also been extended to a wide range of fields, including computer vision [24, 36, 130, 152], and speech recognition [23, 76, 109]. While Transformer models have shown significant performance improvements, their growing size and run-time complexity present a critical challenge in efficient inference. While DNN accelerators that enable fast and efficient deep learning computation can be a viable solution, there is still

However, we observed that scheduling matmuls involve similar amounts of decision points and a wide range of performance outcomes, making it as challenging as scheduling convolutions. - Fusing LayerNorms with the preceding matmuls in the Transformer architecture imposes several constraints on the mapping, particularly related to tile sizes. As a result, careful consideration must be taken when deciding whether to fuse operations, contrary to the common belief that operator fusion is generally beneficial.

Finally, throughout the paper, we conducted case studies to quantify the advantages of co-design and co-optimization techniques across the stack on full-stack Transformer inference. Overall, the result exhibited 88.7× EDP improvement without a noticeable performance drop compared to a naive implementation without full-stack considerations.

In Sec. 3.4, we applied hardware design techniques in order to avoid the high communication overhead associated with offloading unsupported operations to the host CPU. Gemini was originally designed for CNN workloads, and making it perform Softmax, LayerNorm, and GELU on-chip required additional changes. Our implementation of dedicated normalization units to support Softmax and LayerNorm had an associated 5−15% area cost and 8% latency increase. Nonetheless, this extra overhead was compensated by the gain achieved by running the nonlinear operations on-chip using the polynomial approximation proposed in I-BERT [111]. Combined with memory hierarchy re-balancing, this provided a net 39.6× latency reduction.
In Section 6.4, we ran NAS to search for Pareto-optimal Transformer architectures given the tradeoff between EDP and perplexity on a popular language modeling task. We used Timeloop simulated numbers to estimate the cost of various architectures within a large search space and guide the automated NAS process. The total contribution as shown in Fig. 25 is 2.24× EDP reduction without a noticeable perplexity drop, and 10.56× EDP reduction with 1 perplexity drop.

We anticipate that our in-depth analysis and results, along with the comprehensive survey presented in this paper will facilitate further advencements in understanding Transformer inference and optimizing its inference efficiency from various angles. We believe that this will enable Transformers to reach their full potential and expand their application to a much wider range of areas than what they have achieved so far.

post contain ""

No matching posts found containing ""

Share Your Feedback 🏝️

Survey Efficiency | Full Stack Optimization of Transformer

Survey Efficiency | Full Stack Optimization of Transformer

Full Stack Optimization of Transformer Inference: a Survey

TL;DR

1 INTRODUCTION

2 TRANSFORMER MODEL ARCHITECTURE AND PERFORMANCE BOTTLENECKS

2.1 High-Level Overview of Transformer

2.1.1 Nonlinear Operations.

2.1.2 Encoder and Decoder Architectures.

2.2 Model Analysis

2.2.1 Workload Analysis.

2.2.2 Profiling.

3 HARDWARE DESIGN

3.1 Overview of Typical DNN Accelerators

3.2 Adapting DNN Accelerators fo Transformers

3.3 Analytical Modelling

3.4 Case Study: Building a Transformer Accelerator

3.4.1 Baseline Accelerator.

3.4.3 Memory Hierarchy.

3.4.4 Hardware-Software Co-Design.

3.4.2 Performance Bottlenecks.

4 MODEL OPTIMIZATION

4.1 Quantization

4.2 Sparsity

4.3 Transformer-specific Optimization

4.3.1 Accelerating Attention.

4.3.2 Nonlinear Operations.

4.3.3 Accelerating Decoding.

4.3.4 Selecting Which Optimization Methods to Use.

5 MAPPING TRANSFORMERS TO HARDWARE

5.1 What are Mappings?

5.2 What Are the Key Mapping Decisions?

5.2.1 Graph-level.

5.2.2 Operation-level.

5.3 Finding Performant

5.3.1 Mapping Strategies.

5.4 Performance

5.5 Transformer vs CNN Mapping

5.5.1 Mapspace Characterization of Transformers.

5.5.2 Scheduling Complexities of LayerNorm and Softmax.

6 ADAPTING TRANSFORMER ARCHITECTURE WITH NAS

6.1 Neural Architecture Search

6.1.2 Search Method.

6.1.3 Weight Sharing and Supernetwork.

6.1.1 Search Space.

6.1.4 Evaluation Method.

6.2 Hardware-aware NAS Hardware-aware

6.3 Transformer-specific

6.4 Case Study: Running NAS and Co-design

6.4.1 Experiment Setup.

6.4.2 Experiment Results.

7 CONCLUSION

post contain ""

No matching posts found containing ""

Recent Posts

Most Likes

Most Views