Machine Learning Archives

Keyformer: KV Cache reduction through attention sparsification for Efficient Generative Inference

March 27, 2024

TL;DR: Generative AI inference is often bottlenecked by growing KV cache. There have been several numerous strategies proposed to compress the KVCache to allow longer inference-time context lengths. However, most of…Read More

d-Matrix Blog - Tag: Machine Learning

Keyformer: KV Cache reduction through attention sparsification for Efficient Generative Inference