비공개 포스트 초안
Contents
In this post…
Target Audience: Developers
Estimated Reading Time: 15 minutes
This post references Hugging Face’s post: Methods and tools for efficient training on a single GPU, and for more detailed information, please refer to Stas Bekman’s Machine Learning Engineering. Other images and materials referenced are duly credited.
*Figure 1: The image is reconstructed by the author based on Attention Is All You Need (Vaswani et al., 2017) and Full Stack Optimization of Transformer Inference: a Survey (Sehoon kim et al. 2023).
Let’s estimate the FLOPs (Floating Point Operations) of Transformers models.
We’ll examine the parameters influencing Transformers’ FLOPs. Attention layers and Feed-Forward Network (FFN) layers predominantly contribute to the FLOPs, while layer normalization and residual connections have relatively less impact on the total FLOPs. Therefore, we can primarily estimate based on attention layers and FFN layers.
Before estimating FLOPs, let’s summarize the necessary parameters:
Attention layers and FFN layers predominantly contribute to the total FLOPs. Let’s derive the formula for calculating total FLOPs and examine it slowly as depicted in the figure:
Figure 2: The image is reconstructed by the author based on Attention Is All You Need (Vaswani et al., 2017), Full Stack Optimization of Transformer Inference: a Survey (Sehoon kim et al. 2023), and presentation materials by Kim Jooyoung, CEO of Hyper Accel.
Let’s estimate the FLOPs of the LLaMA-2 7B model, considering the specific architecture information.
We’ll ignore the model’s layer normalization and residual connections and focus on the FLOPs of attention layers and FFN layers.
The architecture used for estimation of the LLaMA-2 7B model is as follows:
from transformers import AutoTokenizer
import transformers
import torch
model = "meta-llama/Llama-2-7b-hf"
tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = transformers.pipeline(
"text-generation",
model=model,
torch_dtype=torch.float16,
device_map="auto",
)
sequences = pipeline(
'I liked "Breaking Bad" and "Band of Brothers". Do you have any recommendations of other shows I might like?\n',
do_sample=True,
top_k=10,
num_return_sequences=1,
eos_token_id=tokenizer.eos_token_id,
max_length=200,
)
for seq in sequences:
print(f"Result: {seq['generated_text']}")
print(pipeline.model.config)
LlamaConfig {
"_name_or_path": "meta-llama/Llama-2-7b-hf",
"architectures": [
"LlamaForCausalLM"
],
"bos_token_id": 1,
"eos_token_id": 2,
"hidden_act": "silu",
"hidden_size": 4096,
"initializer_range": 0.02,
"intermediate_size": 11008,
"max_position_embeddings": 4096,
"model_type": "llama",
"num_attention_heads": 32,
"num_hidden_layers": 32,
"num_key_value_heads": 32,
"pad_token_id": 0,
"pretraining_tp": 1,
"rms_norm_eps": 1e-05,
"rope_scaling": null,
"tie_word_embeddings": false,
"torch_dtype": "float16",
"transformers_version": "4.31.0",
"use_cache": true,
"vocab_size": 32000
}
print(pipeline.model)
LlamaForCausalLM(
(model): LlamaModel(
(embed_tokens): Embedding(32000, 4096, padding_idx=0)
(layers): ModuleList(
(0-31): 32 x LlamaDecoderlayer(
(self_attn): Llamaattention(
(q_proj): Linear(in_features=4096, out_features=4096, bias=False)
(k_proj): Linear(in_features=4096, out_features=4096, bias=False)
(v_proj): Linear(in_features=4096, out_features=4096, bias=False)
(o_proj): Linear(in_features=4096, out_features=4096, bias=False)
(rotary_emb): LlamaRotaryEmbedding()
)
(mlp): LlamaMLP(
(gate_proj): Linear(in_features=4096, out_features=11008, bias=False)
(up_proj): Linear(in_features=4096, out_features=11008, bias=False)
(down_proj): Linear(in_features=11008, out_features=4096, bias=False)
(act_fn): SiLUActivation()
)
(input_layernorm): LlamaRMSNorm()
(post_attention_layernorm): LlamaRMSNorm()
)
)
(norm): LlamaRMSNorm()
)
(lm_head): Linear(in_features=4096, out_features=32000, bias=False)
)
In order to make estimations, let’s summarize the necessary parameter values:
Calculation for LLaMA-2 7B:
\[\begin{aligned} \text{Total FLOPs} = & (\text{Total FLOPs for attention Layers}) \\ & + (\text{Total FLOPs for FFN Layers}) \end{aligned}\] \[\begin{aligned} \text{Total FLOPs} = & 32 \text{ Layers} \times [(2 \times (512)^2 \times (4,096)^2 \\ & + 4 \times 512 \times 4,096 \times 11,008) \\ & \times 32 \text{ Heads}] \end{aligned}\]$= 20,547,123,544,064 \quad … \text{(Value 1)}$
The above content can be summarized into Python code as follows:
# Given parameters
sequence_length = 512
hidden_units = 4096
number_of_layers = 32
heads = 32
ffn_inner_dim = 11008
# Calculating FLOPs
# 1. FLOPs for Self-Attention layers per layer
flops_per_attention_layer = 2 * sequence_length * hidden_units * hidden_units
# Total FLOPs for all Self-Attention layers
total_flops_attention = flops_per_attention_layer * number_of_layers * heads
# 2. FLOPs for Feed-Forward Network (FFN) layers per layer
flops_per_ffn_layer = 2 * 2 * sequence_length * hidden_units * ffn_inner_dim
# Total FLOPs for all FFN layers
total_flops_ffn = flops_per_ffn_layer * number_of_layers
# 3. Total FLOPs
total_flops = total_flops_attention + total_flops_ffn
total_flops
The Total FLOPs for the LLaMA-2 7B model with a sequence length of 512 is 20,547,123,544,064, excluding calculations for layer normalization, residual connections, etc.
For inferring a single token, the required sequence is obtained by dividing this by 512, resulting in a value of 40,131,100,672.
Conclusively, the FLOPs for the LLaMA-2 model can be computed using the following expression:
\[\begin{align*} \text{Total FLOPs} = & \ \text{Number of Layers} \times \Bigg[ \left(2 \times \text{Sequence Length} \times \text{Sequence Length} \times \text{Hidden Units} \right. \\ & \left. + \text{Sequence Length} \times \text{Hidden Units} \times \text{Hidden Units}\right) \times \text{Number of Heads} \\ & + 2 \times 2 \times \text{Sequence Length} \times \text{Hidden Units} \times \text{FFN Inner Dimension} \Bigg] \end{align*}\]Simplified, this can be expressed as: \(\begin{aligned} \text{Total FLOPs} \approx \text{Number of Layers} \times [ & (2 \times (\text{Sequence Length})^2 \times (\text{Hidden Units})^2 \\ & + 4 \times \text{Sequence Length} \times \text{Hidden Units} \\ & \quad \times \text{FFN Inner Dimension}) \\ & \times \text{Number of Heads} ] \end{aligned}\)
Substituting this into a simpler formula, we arrive at (Equation 4), resulting in a value of 2,199,023,255,552 for the LLaMA-2 7B model:
\(\text{Simplified Total FLOPs} = L^2 \times H^2 \times 4D \times \text{Heads} \times \text{Layers} \quad \text{... (Equation 4)}\) \(= (32^2) \times (4096^2) \times (4 \times 11008) \times 32 \times 32\) \(= 2,199,023,255,552 \quad \text{... (Value 3)}\)
The total FLOPs required for a sequence length of 512 in the LLaMA-2 7B model are 20,547,123,544,064. When this is divided by 512, the FLOPs required for a single token are calculated to be 40,131,100,672.
Assuming the prediction of a single token, we can substitute \(L = 1\) into the simplified expression, resulting in the following formula for per-token computation:
\[\text{Simplified Total FLOPs / 1 Token} = H^2 \times 4D \times \text{Heads} \times \text{Layers}\]Given this simplification, the FLOPs per single token are computed as 4,294,967,296.
The estimated value from the comprehensive FLOP calculation (referred to previously as Equation 4) is 2,199,023,255,552. A discrepancy is observed between this value and the total FLOPs for the sequence (20,547,123,544,064), attributed to the exclusion of calculations for components such as layer normalization and residual connections, focusing instead solely on the sequence length of 512.
Thus, the FLOPs for one token can be determined to be either 40,131,100,672 or 4,294,967,296, depending on the level of simplification and the components included in the calculation.
Conclusively, for the standard architecture of the LLaMA-2 7B model, an approximate total FLOPs calculation for a single token prediction can be represented by scaling up the simplified estimation by a factor of ten:
\[\text{Rough Estimation of LLaMA-2 Total FLOPs / 1 Token} = H^2 \times 4D \times \text{Heads} \times \text{Layers} \times 10\]In the context where the usage of MoE (Mixture of Experts) layers in GPT-4 is becoming established, open-source models like Mistral have also introduced services utilizing MoE. In such scenarios, assuming Mistral 7B utilizes at least two of its eight expert models for inference, the system may either repetitively load models into memory or keep all eight 7B models in VRAM, necessitating repeated VRAM loading and unloading (I/O).
When considering the usage of multiple expert models, the linear demand for VRAM specifications increases. For instance, assuming all eight LLMs are LLaMA-2 7B models, naive inference with all parameters stored in 32 bits would require 4 bytes per parameter * 70 billion parameters = 280 billion bytes × 8 = 224 GB of VRAM. Even with inference using 16 bits (2 bytes) per parameter, it would require 112 GB of VRAM. This is for the case of a 7B model; however, if larger parameter LLMs such as 13B, 70B, or 180B were to be used (though it’s rare to deploy such large models in multiple quantities due to the nature of MoE), the VRAM specification required would increase geometrically.
The A100 GPU is designed with a default of 40 GB of VRAM, and even with an 80 GB A100, inference tasks could become challenging.
In terms of precision (float32), all model parameters are stored as 32 bits (or 4 bytes). Hence, for inference, 4 bytes per parameter * 70 billion parameters = 280 billion bytes = 28 GB of GPU memory would be required. For half precision, where each parameter is stored as 16 bits (or 2 bytes), only 14 GB would be required for inference. In the case of using 8-bit and 4-bit algorithms, each parameter would use 4 bits (or 1/2 byte), requiring 3.5 GB of memory for inference.
For training, the memory requirement per parameter typically doubles due to the storage of gradients and second-order gradients when using standard AdamW. Therefore, for a 7B model, during training, 8 bytes per parameter * 70 billion parameters = 56 GB of GPU memory would be required, multiplied by the number of expert models when adopting MoE.
Using AdaFactor reduces the memory requirement to 4 bytes per parameter, necessitating a total of 28 GB of GPU memory, while optimization programs for bits and bytes (e.g., 8-bit AdamW) would require 2 bytes per parameter, totaling 14 GB of GPU memory.
For further details, refer to the Hugging Face [Post: Methods and tools for efficient training on a single GPU].
Figure 3: The image is reconstructed by the author based on Attention Is All You Need (Vaswani et al., 2017), Full Stack Optimization of Transformer Inference: a Survey (Sehoon Kim et al., 2023), and presentation materials from CEO Kim Jooyoung of Hyper Accel.
이 포스트에서는 …
본 포스트는 허깅 페이스의 포스트: Methods and tools for efficient training on a single GPU을 참고하였으며, 더 자세한 내용은 Stas Bekman의 Machine Learning Engineering을 참고하세요. 그 외 참조한 이미지 등은 출처를 표기하여두었습니다.
*Figure 1: 이미지는 Attention Is All You Need (Vaswani et al., 2017) 및 Full Stack Optimization of Transformer Inference: a Survey (Sehoon kim et al. 2023)을 기반으로 필자가 재구성하였습니다.
Transformers 모델의 FLOPs(Floating Point Operations)를 추산해보겠습니다.
Transformers FLOPs에 영향을 주는 파라미터들을 살펴보겠습니다. attention layers 및 FFN(Feed-Forward Network) layers가 FLOPs의 대부분을 차지하며, layer normalization, residual connections은 상대적으로 total FLOPs에 영향을 덜 미치기 때문에, attention layers과 FFN layers 위주로 추산할 수 있습니다.
FLOPs를 추정하기에 앞서 필요한 파라미터를 정리하면 다음과 같습니다.
Attention layer와 FFN layer의 FLOPs위주로 total FLOPs를 구하는 식을 구해보면, 다음과 같으며 Figure과 같이 천천히 살펴보겠습니다.
*Figure 2: 이미지는 Attention Is All You Need (Vaswani et al., 2017), Full Stack Optimization of Transformer Inference: a Survey (Sehoon kim et al. 2023) 그리고 Hyper Accel의 자료 등을 기반으로 재구성
구체적인 아키텍처에 대한 정보가 있는 LLaMA-2 7B 모델의 FLOPs를 추산해보겠습니다. (모델의 layer normalization과 residual connections는 무시하고, attention layers와 FFN layers의 FLOPs 위주로 살펴봅니다.)
추산에 사용된 LLaMA-2 7B 모델의 아키텍처는 다음과 같으며,
from transformers import AutoTokenizer
import transformers
import torch
model = "meta-llama/Llama-2-7b-hf"
tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = transformers.pipeline(
"text-generation",
model=model,
torch_dtype=torch.float16,
device_map="auto",
)
sequences = pipeline(
'I liked "Breaking Bad" and "Band of Brothers". Do you have any recommendations of other shows I might like?\n',
do_sample=True,
top_k=10,
num_return_sequences=1,
eos_token_id=tokenizer.eos_token_id,
max_length=200,
)
for seq in sequences:
print(f"Result: {seq['generated_text']}")
print(pipeline.model.config)
LlamaConfig {
"_name_or_path": "meta-llama/Llama-2-7b-hf",
"architectures": [
"LlamaForCausalLM"
],
"bos_token_id": 1,
"eos_token_id": 2,
"hidden_act": "silu",
"hidden_size": 4096,
"initializer_range": 0.02,
"intermediate_size": 11008,
"max_position_embeddings": 4096,
"model_type": "llama",
"num_attention_heads": 32,
"num_hidden_layers": 32,
"num_key_value_heads": 32,
"pad_token_id": 0,
"pretraining_tp": 1,
"rms_norm_eps": 1e-05,
"rope_scaling": null,
"tie_word_embeddings": false,
"torch_dtype": "float16",
"transformers_version": "4.31.0",
"use_cache": true,
"vocab_size": 32000
}
print(pipeline.model)
LlamaForCausalLM(
(model): LlamaModel(
(embed_tokens): Embedding(32000, 4096, padding_idx=0)
(layers): ModuleList(
(0-31): 32 x LlamaDecoderlayer(
(self_attn): Llamaattention(
(q_proj): Linear(in_features=4096, out_features=4096, bias=False)
(k_proj): Linear(in_features=4096, out_features=4096, bias=False)
(v_proj): Linear(in_features=4096, out_features=4096, bias=False)
(o_proj): Linear(in_features=4096, out_features=4096, bias=False)
(rotary_emb): LlamaRotaryEmbedding()
)
(mlp): LlamaMLP(
(gate_proj): Linear(in_features=4096, out_features=11008, bias=False)
(up_proj): Linear(in_features=4096, out_features=11008, bias=False)
(down_proj): Linear(in_features=11008, out_features=4096, bias=False)
(act_fn): SiLUActivation()
)
(input_layernorm): LlamaRMSNorm()
(post_attention_layernorm): LlamaRMSNorm()
)
)
(norm): LlamaRMSNorm()
)
(lm_head): Linear(in_features=4096, out_features=32000, bias=False)
)
추산의 필요한 파라미터 값들을 정리하면 다음과 같습니다.
FFN은 일반적으로 두 개의 선형 레이어를 포함하며, 이 레이어들은 입력 차원에서 중간 차원으로, 그리고 중간 차원에서 출력 차원으로의 두 개의 선형 변환을 수행합니다.
\[\begin{aligned} (\text{FLOPs / FFN}) = & 2 \times (\text{Sequence Length}) \\ & \times (\text{Hidden Dimension}) \\ & \times (\text{FFN Dimension}) \times 2 \end{aligned}\] \[\begin{aligned} \text{Total FLOPs for FFN layers} = & [2 \times (\text{Sequence Length}) \\ & \times (\text{Hidden Dimension}) \\ & \times (\text{FFN Dimension}) \times 2] \\ & \times (\text{Number of Layers}) \end{aligned}\]다른 components의 FLOPs를 무시하고, 계산된 self-attention과 FFN의 FLOPs를 더하여 전체 transformers 모델의 FLOPs를 추산할 수 있습니다.
\[\begin{aligned} \text{Total FLOPs} = & (\text{Number of layers}) \\ & \times (\text{FLOPs for self-attention} \\ & \quad + \text{FLOPs for FFN}) \end{aligned}\]위 내용을 파이썬 코드로 정리하면 다음과 같습니다.
# Given parameters (Llama-2)
sequence_length = 512
hidden_units = 4096
number_of_layers = 32
heads = 32
ffn_inner_dim = 11008
# Calculating FLOPs
# 1. FLOPs for Self-Attention layers per layer
flops_per_attention_layer = 2 * sequence_length * hidden_units * hidden_units
# Total FLOPs for all Self-Attention layers
total_flops_attention = flops_per_attention_layer * number_of_layers * heads
# 2. FLOPs for Feed-Forward Network (FFN) layers per layer
flops_per_ffn_layer = 2 * 2 * sequence_length * hidden_units * ffn_inner_dim
# Total FLOPs for all FFN layers
total_flops_ffn = flops_per_ffn_layer * number_of_layers
# 3. Total FLOPs
total_flops = total_flops_attention + total_flops_ffn
total_flops
512의 시퀀스 길이에 필요한 LLaMA-2 7B모델의 Total FLOPs는 20,547,123,544,064로, layer normalization, residual connections 등은 제외하고 계산하였습니다.
한 개의 토큰을 인퍼런스하는데 필요한 시퀀스는 이를 512로 나누면되므로 40,131,100,672라는 값을 얻을 수 있게 됩니다.
\[\text{(Total FLOPs / token)} = 40,131,100,672\]결론적으로 LLaMA-2 모델의 FLOPs는 다음과 같은 식을 통해 구할 수 있습니다.
\[\begin{aligned} \text{Total FLOPs} = \text{Number of Layers} \times [& (2 \times \text{Sequence Length}^2 \times \text{Hidden Units} \\ & + \text{Sequence Length} \times \text{Hidden Units}^2) \\ & \times \text{Number of Heads} \\ & + 4 \times \text{Sequence Length} \times \text{Hidden Units} \\ & \times \text{FFN Inner Dimension}] \end{aligned}\]더 간단히 정리하면 다음과 같으며:
\[\begin{aligned} \text{Simplified Total FLOPs} \approx \text{Number of Layers} \times [& (2 \times (\text{Sequence Length})^2 \times (\text{Hidden Units})^2 \\ & + 4 \times \text{Sequence Length} \times \text{Hidden Units} \\ & \times \text{FFN Inner Dimension}) \\ & \times \text{Number of Heads}] \end{aligned}\]이 수식을 LLaMA-2 7B 모델에 적용하면 다음과 같은 결과를 얻습니다.
\[\begin{aligned} \text{Simplified Total FLOPs} &= L^2 \times H^2 \times 4D \times \text{Heads} \times \text{Layers} \\ &= (32^2) \times (4096^2) \times (4 \times 11008) \times 32 \times 32 \\ &= 2,199,023,255,552 \end{aligned}\]위 값은 512 시퀀스 길이에 필요한 총 FLOPs로 하나의 토큰에 필요한 FLOPs는 마찬가지로 512로 나누면 다음과 같은 식으로 4,294,967,296임을 알 수 있습니다.
(Simplified Total FLOPs / token) = 40,131,100,672 … (값 4)
# Given parameters
sequence_length = 512
hidden_units = 4096
number_of_layers = 32
heads = 32
ffn_inner_dim = 11008
# Calculate the simplified FLOPs
simplified_flops = number_of_layers**2 * hidden_units**2 * 4 * heads
simplified_flops
즉, 한 개의 토큰을 예측한다고 가정하면 \(L = 1\)로 대치할 수 있고, 다음과 같이 간단히 표현할 수 있습니다.
\(\text{(Simplified Total FLOPs / 1 Token)} = H^2 \times 4D \times \text{Heads} \times \text{Layers}\) \(\text{(Simplified Total FLOPs / 1 Token)} = 4,294,967,296\)
위 계산식의 추산 값은 2,199,023,255,552로, 영향이 미비한 component(layer normalization, residual connections)를 제외하고 계산한 값인 20,547,123,544,064와는 차이가 있지만 이는 512 길이의 시퀀스 길이에 따른 값입니다.
하나의 토큰에 대한 FLOPs 값은 40,131,100,672와 4,294,967,296으로 구할 수 있습니다.
결론적으로 LLaMA-2 7B의 표준 아키텍처에서는 추산식에 10배인 다음 식으로 전체 FLOPs를 추산할 수 있음을 알 수 있습니다.
\[\text{(Rough Estimation of LLaMA-2 Total FLOPs / 1 Token)} = H^2 \times 4D \times \text{Heads} \times \text{Layers} \times 10\]위와 같이 간단하게 근사하고, 다양한 요인들은 무시한다면 각 토큰 예측에 필요한 FLOPs의 대략적인 총량을 추정할 수 있습니다. 특히, 10배를 곱하는 과정은 대략적으로 직관적인 FLOPs를 구하기 위한 방법으로 부정확할 수 있습니다.
많은 사람들이 GPT-4가 MoE(Mixture of Experts, 이하 “MoE”) 레이어를 사용하는 것으로 추정하고 있는 상황에서, Mistral과 같은 오픈 소스 모델들도 MoE를 사용하는 서비스를 공개하였습니다. 이런 상황에서, Mistral 7B가 8개의 expert 모델 중 최소 2개 이상의 모델을 사용해 인퍼런스 한다고 가정할 경우, 메모리에는 반복적인 모델 로드를 수행하거나 혹은 8개의 7B 모델을 VRAM에 올려두어야하거나 반복적으로 VRAM에 올리고 내려야(I/O)할 수 있습니다.
위와 같이 복수개의 expert models를 사용한다고 가정할 경우, 선형적으로 요구하는 VRAM 메모리 사양이 늘어나게 되는데, 8개의 모든 LLM이 LLaMA-2 7B 모델이라고 가정하면, naive하게 모든 parameters가 32bits로 인퍼런스할 경우 4bytes/parameters * 70B parameters = 280억 bytes X 8 = 224GB의 VRAM을 요구하게 됩니다. 만약 16bits 또는 2bytes로 인퍼런스한다고 가정하여도, 112GB의 VRAM을 요구하고, 이는 단순히 7B 모델의 경우이므로 만약 13B, 70B, 180B 등 더 큰 파라미터의 LLM을 사용한다고 가정하면(MoE 특성상 이렇게까지 큰 모델들을 여러개 올리는 일이 거의 없겠지만), 기하 급수적으로 요구하는 VRAM 사양이 높아지게 됩니다.
A100이 기본적으로 40GB의 VRAM으로 설계되어 있고, 80GB A100으로도 인퍼런스가 힘들어질 수 있습니다.
전체 precision(float32)에서는 모델의 모든 parameters가 32bits(또는 4bytes)로 저장됩니다. 따라서 인퍼런스에만 4bytes/parameters * 70B parameters = 280억 bytes = 28GB의 GPU 메모리가 필요합니다. 절반의 precision에서는 각 parameters가 16bits(또는 2bytes)로 저장되므로 인퍼런스에만 14GB가 필요하고, 8bits 및 4bits 알고리즘의 경우 parameters당 각각 4bits(또는 1/2bytes)를 사용하는데 인퍼런스를 위해 3.5GB의 메모리가 필요합니다.
훈련의 경우 훈련하는 방식에 따라 다르지만, 일반적으로 일반 AdamW를 사용하는 경우 (parameters뿐만 아니라 해당 gradients 및 second-order gradients도 저장하므로) parameters당 8bytes가 필요합니다.
따라서, 7B 모델의 경우 학습시 parameters당 8bytes * 70억 개의 parameters = 56GB의 GPU 메모리가 필요하며, MoE를 채택할 경우 expert models의 수 만큼 곱해야 합니다.
AdaFactor를 사용하는 경우 parameters당 4bytes가 필요하고, 총 28GB의 GPU 메모리가 필요하며, bits와 bytes의 최적화 프로그램(e.g., 8bits AdamW)을 사용해도 parameters당 2bytes가 필요하여 총 14GB의 GPU 메모리를 확보해야 할 수 있습니다.
자세한 내용은 다음 허깅 페이스의 포스트: Methods and tools for efficient training on a single GPU를 참조하세요.
*Figure 3: 이미지는 Attention Is All You Need (Vaswani et al., 2017), Full Stack Optimization of Transformer Inference: a Survey (Sehoon kim et al. 2023) 그리고 Hyper Accel의 자료 등을 기반으로 재구성
[1] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin. “Attention Is All You Need.” arXiv preprint arXiv:1706.03762 (2017).
[2] Sehoon Kim, Coleman Hooper, Thanakul Wattanawong, Minwoo Kang, Ruohan Yan, Hasan Genc, Grace Dinh, Qijing Huang, Kurt Keutzer, Michael W. Mahoney, Yakun Sophia Shao, Amir Gholami. “Full Stack Optimization of Transformer Inference: a Survey” arXiv preprint arXiv:2302.14017 (2023).
[3] Decoding transformers on edge devices
[4] Hyper Accel
CC BY-NC-ND 4.0 KR. All right reserved.
Copyright (c) 2023 Author: Minwoo Park, South Korea
All Copyright (c) Reserved.