Transformers-based large language models (LLMs) have emerged as the underpinning architecture for modern natural language processing. Today, the at-scale deployment of generative AI is gated by the prohibitive cost of LLM inference deployment on state-of-the-art systems. Furthermore, low latency LLM inference, which is either impossible or expensive today, could unlock new use cases such as chain-of-thought reasoning, pair-programming, agentic workflows etc.
To reduce serving costs while delivering acceptable latencies, the industry has gravitated towards smaller models, sparse models like mixture of experts, and alternative attention mechanisms such as group-query attention (GQA). Nevertheless, the key issues of expensive deployment costs, and high inference latency remain.
In this guest lecture at Prof. Sophia Shao’s UC Berkeley Hardware for Machine learning class, d-Matrix cofounder Sudeep Bhoja and team discusses a grounds-up co-designed hardware and software architecture optimized for generative inference. Stepping through the key characteristics of the LLM inference workload alongside a novel approach from d-Matrix, Bhoja explains how his team designed a modular chiplet-based CGRA-like architecture tailor-made for LLM inference and steps through the scale-out of architecture from chiplets to multiple nodes.
In addition to the hardware considerations, the team looks at the associated software design of modern systems including collective communication algorithms and the distributed inference serving stack, focused on how they interoperate with model architecture innovations and full-stack techniques. Through this examination, d-Matrix team demonstrates ultra-low latency, high throughput LLM inference.