AI Inference is on an unsustainable trajectory
The arrival of ChatGPT in 2022 caught the world’s imagination around AI – and there has been no looking back. The discussions have been largely focused on model training, where we hear about really large models, how much money it takes to train them and how much power it is going to consume.
What we don’t hear about enough is the fact that once the model is trained, ROI is realized from deploying the model in production, i.e. performing inference. And that the volume of inference is projected to grow significantly, leading to an unsustainable increase in cost and power consumption.
At the recent OCP Global Summit in Silicon Valley, our VP of Product, Sree Ganesan, dug into the shift happening in GenAI. Her talk “How d-Matrix Is Leveraging ODSA’s BoW Die-to-Die Link to Transform Generative AI Inference from Unsustainable to Attainable” focuses on the unique challenges of Gen AI workloads and what d-Matrix is doing about them.
Generative AI workloads have unique challenges
d-Matrix anticipated the demands of generative AI from the start. Founded in 2019, the company made an early bet on Transformer models, focusing on the dual nature of generative inference: the two-part workload of prompt processing + token generation. Prompt processing is a compute bound process that involves taking the user’s input and generating initial tokens. Token generation, on the other hand, is memory bandwidth bound. This is where traditional architectures fail to scale – hitting a memory wall and limiting performance. In addition, given the model sizes and context lengths, these workloads require more memory capacity. These requirements drive high compute costs and power consumption.
d-Matrix solves these challenges leveraging Open Source
We tackle these challenges with a novel digital in-memory compute architecture (DIMC) that integrates memory and compute into one cohesive solution, breaking through the memory wall and making generative inference more efficient. The architecture scales using chiplets connected through high-speed die-to-die links. These interconnects are based on the Open Domain-Specific Accelerators (ODSA) Bunch Of Wires (BOW) open standard, and support low-energy, low-latency data transfer between chiplets via an “all-to-all” topology which is critically important for fast token generation. These technologies are integrated into our first product, d-Matrix Corsair, which comes in an industry-standard PCIe card form factor with two packages per card. Each Corsair package contains four chiplets, creating a scalable memory-compute complex with up to 150 TB/s memory bandwidth. Corsair has native support for block floating-point numerics, yet another (recently) open standard, the OCP Microscaling format. By supporting MXINT8 and MXINT4 numerics, Corsair enables increased inference efficiency and aligns with the broader AI community’s direction.
Supporting Corsair’s hardware is d-Matrix’s Aviator software stack. Developed to integrate smoothly with open AI software ecosystems, Aviator enables easy model conversion and users can bring trained models from GPUs or other systems onto d-Matrix hardware. Built with open-source software such as OpenBMC, MLIR, PyTorch and the Triton DSL for custom kernel creation, Aviator includes native support for distributed inference across multiple Corsair cards and servers, necessary for handling large-scale, memory-intensive generative AI models.
Taking an open-standards approach ensures that Corsair is compatible with a wide range of data centers and AI servers, ensuring our customers can easily integrate d-Matrix as part of their existing infrastructure. We prioritized compatibility with the open ecosystem to make it easy for enterprises to adopt Corsair without overhauling or interfering with existing AI solutions and to ensuring the broadest access to the high-performance generative inference capabilities available.
To unleash the full potential of GenAI and make it widely accessible, we at d-Matrix believe it needs to be delivered in an affordable and sustainable way, without sacrificing performance. With Corsair, d-Matrix is making generative AI go from “unsustainable” to finally — commercially viable.
Listen to Sree’s talk on how d-Matrix is advancing Generative AI here:
All trademarks, logos and brand names are the property of their respective owners.
Corsair™ is a trademark of d-Matrix.