Attn | Multi-matrix Factorization Attention

Created: 2025-01-04 11:45:54 +0000

Last modified: 2025-01-04 20:56:50 +0900

Multi-matrix Factorization Attention

url: https://arxiv.org/abs/2412.19255

pdf: https://arxiv.org/pdf/2412.19255

html: https://arxiv.org/html/2412.19255v1

abstract: We propose novel attention architectures, Multi-matrix Factorization Attention (MFA) and MFA-Key-Reuse (MFA-KR). Existing variants for standard Multi-Head Attention (MHA), including SOTA methods like MLA, fail to maintain as strong performance under stringent Key-Value cache (KV cache) constraints. MFA enhances model capacity by efficiently scaling up both the number and dimension of attention heads through low-rank matrix factorization in the Query-Key (QK) circuit. Extending MFA, MFA-KR further reduces memory requirements by repurposing the key cache as value through value projection re-parameterization. MFA’s design enables strong model capacity when working under tight KV cache budget, while MFA-KR is suitable for even harsher KV cache limits with minor performance trade-off. Notably, in our extensive and large-scale experiments, the proposed architecture outperforms MLA and performs comparably to MHA, while reducing KV cache usage by up to 56% and 93.7%, respectively.

Attn | Multi-matrix Factorization Attention

Attn | Multi-matrix Factorization Attention

Attn | Multi-matrix Factorization Attention

Multi-matrix Factorization Attention

post contain ""

No matching posts found containing ""

Recent Posts

Most Likes

Most Views

Share Your Feedback 🏝️

Attn | Multi-matrix Factorization Attention

Attn | Multi-matrix Factorization Attention

Multi-matrix Factorization Attention

post contain ""

No matching posts found containing ""

Recent Posts

Most Likes

Most Views