
Keyformer: KV Cache reduction through attention sparsification for Efficient Generative Inference
March 27, 2024
TL;DR: Generative AI inference is often bottlenecked by growing KV cache. There have been several numerous strategies proposed to compress the KVCache to allow longer inference-time context lengths. However, most of…Read More