In this talk, d-Matrix CTO/Cofounder Sudeep Bhoja discusses the impact that the release of the Deep Seek R1 model is having on inference compute.
Stepping through the evolution of reasoning models and the significance of inference time compute in enhancing model performance, Sudeep gives us a look at the techniques, methods and the implications in detail.
Highlights:
- Reasoning models rely on “inference time compute.” They will unlock the golden age of inference.
- DeepSeek R1 is only the first of many open models that will compete with frontier models. Distillation makes smaller models much more capable.
- Unlocking efficiency from model architecture and algorithmic techniques today
- Models are highly memory bound, so GPUs end up being under-utilized.
- Deploying with efficient inference compute platform like d-Matrix Corsair will result in faster speed, cost savings and energy efficiency
Inference Time Compute: Sudeep shares the characteristics of inference time compute on models large and small, noting that the more computations you do during inference, the better the model gets. There is a balancing act, as you enhance the model with more inference time compute, the latency also increases which results in significant waiting time for users to see a response.
Reviewing performance numbers, he steps through the generation of synthetic data sets from these new open source models and what is involved in the distillation into smaller models. Using the distilled data set created from a larger teacher model and doing supervised fine tuning on smaller student models, these models become much more capable.
Finally, Sudeep explains that the reasoning models are highly memory bound and end up underutilizing the GPUs that are optimized for training. He highlights the potential of new architectures and purpose-built ASICs like our d-Matrix Corsair, which delivers efficient inference time compute, dramatically reduces latency, improves energy efficiency and is ideal for the age of inference.