d-Matrix engineers will be sharing their latest research at NeurIPS again this year. Among the innovations are new techniques for transforming workloads for more efficient inference.
“I’d put d-Matrix R&D and engineering teams up against any in the world in their ability to solve complex problems. The team continues to innovate on top of other first-principle approaches and calculations we have built into our solutions. Very proud of the team.” – Sid Sheth, CEO.
At the 2024 NeurIPS Efficient Natural Language and Speech Processing (ENLSP) Workshop, focus is on real-world use cases and how to make large language and foundation models more efficient in terms of Architecture, Training, and Inference. We will be presenting two papers at this workshop:
Scaling laws for post-training quantized large language models
Generalization abilities of well-trained large language models (LLMs) are known to scale predictably as a function of model size. In contrast to the existence of practical scaling laws governing pre-training, the quality of LLMs after post-training compression remains highly unpredictable, often requiring case-by-case validation in practice. In this work, we attempted to close this gap for post-training weight quantization of LLMs by conducting a systematic empirical study on multiple LLM families quantized to numerous low-precision tensor data types using popular weight quantization techniques. We identified key scaling factors pertaining to characteristics of the local loss landscape, based on which the performance of quantized LLMs can be reasonably well predicted by a statistical model.
Scaling laws for post-training quantized large language models
Post Training Quantization of Large Language Models with Microscaling Formats
Large Language Models (LLMs) have distinguished themselves with outstanding performance in complex language modeling tasks, yet they come with significant computational and storage challenges. This paper explores the potential of quantization to mitigate these challenges. We systematically study the combined application of three well-known post-training techniques, SmoothQuant, AWQ, and GPTQ, and provide a comprehensive analysis of their interactions and implications for advancing LLM quantization. We enhance the versatility of these methods by enabling quantization to microscaling (MX) formats, extending the applicability of these PTQ algorithms beyond their original fixed-point format targets. We show that combining different PTQ methods enables us to quantize models to 4-bit weights and 8- bit activations using the MINT format with negligible accuracy loss compared to the uncompressed baseline.
Our research and innovation around pushing the boundaries and making models more efficient using non-traditional hardware and co-design principles. This innovative design consideration is also taking center stage at the ML Workship held at NeurIPS
Post Training Quantization of Large Language Models with Microscaling Formats
The ML with New Compute Paradigms (MLNCP) Workshop at NeurIPS 2024 aims to establish new synergies between ML models and non-traditional hardware. We will be presenting the following paper:
SLaNC : Static LayerNorm Calibration
The ever increasing sizes of Large Language Models (LLMs) beyond hundreds of billions of parameters have generated enormous pressure on the manufacturers of dedicated hardware accelerators and made the innovative design of the latter one of the most rapidly expanding fields of the AI industry. Various approaches have been explored to enable efficient and accurate processing of LLMs on the available accelerators given their computational and storage limitations. Among these, various quantization techniques have become the main focus of the community as a means of reducing the compute, communication and storage requirements. Quantization to lower precision formats naturally poses a number of challenges caused by the limited range of the available value representations. When it comes to processing the popular Transformer models on hardware, one of the main issues becomes calculation of the LayerNorm simply because accumulation of the variance requires a much wider dynamic range than the hardware enables. In this article, we address this matter and propose a computationally-efficient scaling technique that can be easily applied to Transformer models during inference.
SLaNC : Static LayerNorm Calibration
Our method suggests a straightforward way of scaling the LayerNorm inputs based on the static weights of the immediately preceding linear layers. The scaling factors are computed offline, based solely on the linear layer weights, hence no latency or computational overhead is added during inference. Most importantly, our technique ensures that no numerical issues such as overflow or underflow could happen during the compute. This approach offers smooth, accurate and resource-effective inference across a wide range of hardware architectures. The article provides theoretical justification as well as supporting numerical simulations.
d-Matrix has over a hundred researchers and engineers dedicated to the latest technologies. We thank our team for their efforts to push the world forward. If you are an innovator or engineer that pushes those boundaries, we’d love to talk to you. We’re hiring!