The Complete Recipe to Unlock AI Reasoning at Enterprise Scale

From train-time to inference-time compute   

Scaling laws are moving beyond pre-training to post-training and test-time scaling. Reasoning language models like OpenAI o1 and o3 are engineered to emulate human-like problem-solving. Techniques like Chain-of-Thought (CoT) improve model capabilities by generating intermediate “think” steps. While this “thinking longer” approach strengthens accuracy on complex tasks, it also significantly increases the computational load at inference time – i.e. more inference-time compute. Moreover, most high-quality reasoning models originated in closed-source frontier AI labs and have typically been priced higher than other language models, putting access out-of-reach for most developers and enterprises.

The DeepSeek moment – open and efficient reasoning models

DeepSeek was a huge catalyst for the industry as they set a new AI benchmark, innovating beyond traditional scaling laws and delivering open and efficient models. They overcame compute and data constraints through curated high-quality data, novel architecture, and engineering breakthroughs. The DeepSeek-V3 base model was openly released in December alongside details on efficient techniques they utilized like Multi-Token Prediction (MTP), Mixture-of-Experts (MoE), and lower precision FP8 arithmetic. A month later they revealed DeepSeek-R1, a frontier open reasoning model that rivals OpenAI’s o1. DeepSeek-R1 uses synthetic data and reinforcement learning (RL) to improve its capabilities.

R1 highlighted the importance of inference-time-compute, cementing the philosophical shift from developing “all-knowing” models that immediately answer everything (system 1 thinking) to models that discover better solutions using deliberative thinking (system 2 thinking). 

Another breakthrough was distilling the knowledge of DeepSeek-R1 into smaller open-source models such as Llama- 8B and 70B, and Qwen- 1.5B, 7B, 14B and 32B by doing supervised fine tuning with high quality training data curated from R1. This showed an effective and economical method of delivering advanced reasoning capabilities to smaller models used in everyday enterprise and consumer applications.  

While the frontier reasoning models had been closed source, DeepSeek took the open-weights path; and released not only the model weights but also detailed techniques and recipes in their paper. Many leading AI labs have already started similar projects and will be opening the flood gates of open-source innovation. Within just 10 days of the R1 launch, new open base models – Ai2 Tulu3-405B and Alibaba Qwen2.5-Max – were announced that claimed to surpass GPT-4o and DeepSeek V3, and reasoning models are expected soon as well. Hugging Face has launched the Open-R1 project, a fully open reproduction of DeepSeek-R1’s data and training pipeline, to validate its claims and push the boundaries of open research.

But the recipe is incomplete

The launch of DeepSeek-R1 is a watershed moment for Reasoning AI. But it still has a scaling problem, just of a different kind. We traded the complexity of training for higher complexity of inference, i.e. more inference time compute is needed for long chains of thought and for exploring multiple paths before presenting a solution to the user. This directly correlates to much longer inference times (sometimes 100x or more) and higher power costs per user query. For example, DeepSeek-R1 would take 28 minutes to generate 100K “think” tokens when running on a GPU at 60 tokens/sec. When applied to millions of inference requests, this suddenly reverses the economy of scale.     

Token generation in LLMs is inherently memory bandwidth bounded, exhibiting very low arithmetic intensity. In fact, there is a significant gap between the computational TOPS and the memory bandwidth supported by GPUs. Extremely large batch sizes are required to reach compute-bounded operation on GPUs. However, workloads like R1 sees the memory demands continue to increase with large batches and cannot reach a compute-bounded state. GPUs require about 20x more memory bandwidth to overcome this challenge and doing so will require over 2 kW of power just to move data from HBM.  Brute force scaling of memory bandwidth with HBMs is not practical.  

The missing ingredient for the widespread deployment of advanced reasoning capabilities in the enterprise is efficient inference hardware.

The missing ingredient – efficient Inference Compute

At d-Matrix we focused on the missing piece and developed an inference compute platform, Corsair, that is purpose-built for memory-bound workloads. Our novel memory-compute integration offers 150 TB/s aggregate memory bandwidth (as compared to 4-8TB/s of HBM bandwidth) that allows Corsair to reach compute-memory balanced operation at reasonable batch sizes (< 100). This translates to high utilization of hardware, ultra-low latency token generation and a more interactive user experience for multiple users simultaneously. This caters better to MoE models like DeepSeek-R1, where the arithmetic intensity falls even more compared to dense models moving them further into the memory-bound regime.  

Corsair also enables power efficient inference of reasoning models. d-Matrix’s novel Digital In-Memory Compute (DIMC) architecture enables in-place computation, reduces data movement and makes matrix multiplication much more efficient. It also supports block floating point numerics that are more power and area efficient. The result is token generation that can consume up to one third of the power compared to GPU.

Unlocking the power of reasoning models for enterprises

By addressing critical challenges around latency, cost and power, d-Matrix Corsair empowers businesses to unlock the full potential of reasoning language models.  

  • Enhanced User Experiences: Ultra-low latency enables real-time interactions while doing advanced problem-solving for use cases like software code generation or autonomous agents.  
  • Cost-Effective Scaling: The reduced power consumption and cost enable enterprises of all sizes to adopt and deploy models at scale.  
  • Environmental Sustainability: An energy-efficient architecture aligns with corporate sustainability goals, allowing businesses to scale AI without a proportional increase in energy usage.  
  • Ease of Deployment: An industry standard PCIe form factor makes it easy to deploy in AI datacenters. Paired with Aviator software that integrates with broadly adopted open frameworks such as PyTorch, it allows developers to easily deploy their models in production. 

Summary

Recent breakthroughs in AI models have expanded scaling laws from training to inference. Reasoning models require significantly more inference time compute, leading to slow responses and unsustainable scaling of compute. d-Matrix solves the pain points with Corsair, an efficient inference compute platform that drastically reduces latency while being power-efficient and cost effective, making at scale deployments  practical for enterprises. By combining advanced open models with d-Matrix’s efficient hardware solution, a complete recipe emerges that makes Reasoning AI commercially viable for enterprises.

Learn more

Go deeper with a talk from our Distinguished AI SW Architect, Satyam Srivastava – “Making Intelligence Attainable via Novel Architectures” 

What is d-Matrix – Explainer video  

Learn more about d-Matrix Corsair – Product Brief , Whitepaper