Keyformer: KV Cache reduction through attention sparsification for Efficient Generative Inference

March 27, 2024

TL;DR: Generative AI inference is often bottlenecked by growing KV cache. There have been several numerous strategies proposed to compress the KVCache to allow longer inference-time context lengths. However, most of…Read More