Accelerate DeepSeek R1 Distilled Models Locally on AMD Ryzen™ AI NPU and iGPU

Feb 11, 2025

Overview

AMD Ryzen™ AI architecture seamlessly integrates dedicated NPUs, AMD Radeon™ Graphics (iGPU), and AMD Ryzen processing cores (CPU) to enable advanced AI capabilities on a heterogeneous processor. This hybrid architecture optimizes the deployment of Large Language Models (LLMs), leveraging state-of-the-art hardware across various compute engines within the processor to deliver exceptional performance in AI applications.

In this blog, we will explore how to enable DeepSeek distilled models on Ryzen AI 300 series processors. We'll cover the use of ONNX Runtime-GenAI for model representation, AMD Quark quantizer for optimizing models with INT4 weights, and bfloat16 activation precisions. Additionally, we'll discuss deploying these models on hybrid NPU+iGPU systems to achieve fast time-to-first token and efficient text generation.

DeepSeek Distilled Models

Reasoning models like DeepSeek represent a new class of LLMs designed to tackle highly complex tasks by employing a chain-of-thought process. This involves breaking down tasks into multiple smaller logical steps and reasoning through them to arrive at a conclusion. DeepSeek R1 distilled reasoning models are designed for conversational and problem-solving tasks performing competitively with other state-of-the-art LLMs. By distilling knowledge from a larger model into a smaller one, these models facilitate efficient deployment in environments with limited compute resources, such as edge devices and mobile platforms. By quantizing these models to use INT4 parameter precision, it further reduces the model size and delivers higher performance.

Model Optimization

The three models that were evaluated were DeepSeek-R1-Distill-Llama-8B, DeepSeek-R1-Distill-Qwen-1.5B and DeepSeek-R1-Distill-Qwen-7B. To enable the models, the distilled models are quantized using AMD Quark quantizer and all the linear layers are compressed to 4-bit precision (INT4) using Activation-aware Weight Quantization (AWQ). The language model head layer is also compressed to 4-bit precision to further optimize the model and enable faster processing with minimal loss of accuracy as shown in Table 2. The optimized model is exported to ONNX format and inference execution uses ONNXruntime-GenAI software stack.

The implementation of DeepSeek distilled models on Ryzen AI 300 series processors employs a hybrid flow that leverages the strengths of both NPU and iGPU. Ryzen AI software analyzes the optimized model to identify compute and bandwidth-intensive operations, as well as the corresponding precision requirements. The software then partitions the model optimally, scheduling different layers and operations on the NPU and iGPU to achieve the best time-to-first-token (TTFT) in the prefill phase and the fastest token generation (TPS) in the decode phase. This approach is designed to maximize the use of available compute resources, leading to optimal performance and energy efficiency.

Following are few benefits of the AMD hybrid flow:

Efficiency: By distributing the workload between the NPU and iGPU, the hybrid flow optimizes resource utilization resulting in better workload efficiency for power efficient LLM generation.
Scalability: The hybrid flow is highly scalable, allowing for seamless integration with various hardware configurations which enhances the platform's versatility.
Performance: Leveraging both NPU and iGPU provides a balanced approach to handling complex AI tasks, giving the user high throughput, low latency, and stable performance.

Below is an example generated by the compiled Llama3 8B model.

Prompt:

"Please solve following problem and explain it to me. Then give me final answer at the end with a single number preceded by string '#### '. Question: Rory orders 2 subs for $7.50 each, 2 bags of chips for $1.50 each and 2 cookies for $1.00 each for delivery. There’s a 20% delivery fee added at check out and she wants to add a $5.00 tip. What will her delivery order cost?\nAnswer:"

Response from compiled model deployed on hybrid NPU+iGPU Ryzen AI:

Rory's total cost is calculated by adding the cost of all items, the delivery fee, and the tip. First, calculate the cost of the subs, chips, and cookies. Then, add the 20% delivery fee to the total cost of the items. Finally, add the $5.00 tip to the final amount to get the total delivery order cost. </assistant> Okay, so I need to figure out how much Rory's delivery order will cost. Let me break it down step by step.

First, she orders 2 subs, each costing $7.50. So, I should multiply the number of subs by the price per sub. That would be 2 times $7.50. Let me do that calculation: 2 * 7.50 equals $15.00. So, the subs cost $15 in total.

Next, she buys 2 bags of chips, each at $1.50. Again, I'll multiply the number of bags by the price per bag. That's 2 * 1.50, which is $3.00. So, the chips add $3 to the total.

Then, she orders 2 cookies, each costing $1.00. Multiplying 2 by 1.00 gives $2.00. So, the cookies are $2 altogether.

Now, adding up all the items: $15 (subs) + $3 (chips) + $2 (cookies) equals $20. So, the total cost of the items is $20.

But wait, there's a 20% delivery fee added at checkout. I need to calculate 20% of $20 and then add that to the total. 20% of $20 is 0.20 * 20, which is $4. So, adding the delivery fee: $20 + $4 = $24.

Finally, Rory wants to add a $5.00 tip. So, I'll add that to the previous total. $24 + $5 equals $29.

Let me double-check my calculations to make sure I didn't make any mistakes.

Subs: 2 * 7.50 = 15.00 Chips: 2 * 1.50 = 3.00 Cookies: 2 * 1.00 = 2.00 Total items: 15 + 3 + 2 = 20.00 Delivery fee: 20% of 20 = 4.00 Total with delivery: 20 + 4 = 24.00 Tip: 24 + 5 = 29.00

Everything adds up correctly. So, the total delivery order cost is $29. </assistant>

The total cost of Rory's delivery order is calculated by adding the cost of all items, the 20% delivery fee, and the $5.00 tip.

First, calculate the cost of the subs, chips, and cookies: - 2 subs at $7.50 each: 2 × 7.50 = $15.00 - 2 bags of chips at $1.50 each: 2 × 1.50 = $3.00 - 2 cookies at $1.00 each: 2 × 1.00 = $2.00

Adding these together: $15 + $3 + $2 = $20.00.

Next, add the 20% delivery fee: - 20% of $20.00 = 0.20 × 20 = $4.00 - Total with delivery fee: $20 + $4 = $24.00

Finally, add the $5.00 tip: - Total delivery order cost: $24 + $5 = $29.00

#### 29 total tokens: 781

Performance:

The performance of the distilled models was evaluated under various conditions, focusing on metrics such as inference time, resource utilization, and accuracy. The hybrid flow's efficiency in distributing workloads between the NPU and iGPU was also assessed. Table 1 shows the performance of the models for different prompt lengths on Ryzen AI 300 series processor-based laptops¹.

Seq Len	DeepSeek-R1-Distill-Qwen-1.5B		DeepSeek-R1-Distill-Qwen-7B		DeepSeek-R1-Distill-Llama-8B
Seq Len	TTFT [s]	TPS [tok/s]	TTFT [s]	TPS [tok/s]	TTFT [s]	TPS [tok/s]
2048	1.41	52.0	5.12	18.4	5.01	17.6
1024	0.68	60.0	2.64	20.1	2.68	19.2
512	0.41	63.3	1.57	21.0	1.65	20.0
256	0.26	65.6	1.10	21.4	1.14	20.5
128	0.22	66.9	0.90	20.4	0.94	20.7

Table 1: Deepseek performance on Ryzen AI 300 processor

Enabling DeepSeek R1 distilled models on Ryzen AI marks a significant leap forward in AI technology. By leveraging the power of ONNX Runtime GenAI and AMD Quark for generating INT4 weights, and implementing a hybrid flow that combines NPU and iGPU, Ryzen AI delivers exceptional AI performance and efficiency

Accuracy Metrics

To evaluate the model’s performance after optimization, compilation, and deployment on Ryzen AI, we used perplexity scores and the tinyGSM8K metric.

Metric	Reference CPU FP32	Ryzen AI NPU/iGPU hybrid-llm INT4²
Perplexity (seq. len. = 2048)	13.138	13.87
tinyGSM8K (fewshot=0, flexible match)	42.78%	41.16%

Table 2: Optimized model accuracy measurement for DeepSeek-R1-Distill-Llama-8b

As you can see, the accuracy of the model is retained through the process of quantization, optimization, and partitioning on NPU and iGPU.

Conclusion

Model and algorithmic innovations are driving rapid improvements in the capability of LLMs. Ubiquitous deployment of these new models is supported by open software stacks like ONNX Runtime GenAI, and heterogenous processor architectures like Ryzen AI 300 CPU, iGPU, and NPU processors. Agile, hybrid deployment delivers the optimal efficiency, performance and accuracy needed for real-time LLM applications and for supporting future model innovations.

Next Steps

To get started now with Ryzen AI Hybrid LLM models, visit Ryzen AI documentation here: https://ryzenai.docs.amd.com/en/latest/hybrid_oga.html

Quark optimized DeepSeek Distilled models are in AMD HuggingFace collection here: https://huggingface.co/collections/amd/amd-ryzenai-deepseek-r1-distill-hybrid-67a53471e9d5f14bece775d2

Additional Resources

AMD Quark model optimizer https://quark.docs.amd.com/latest/
Lm_eval_harness: https://github.com/EleutherAI/lm-evaluation-harness
Ryzen AI software: https://ryzenai.docs.amd.com/en/latest/index.html
ONNX Runtime GenAI: https://github.com/microsoft/onnxruntime-genai
TurnkeyML: https://github.com/onnx/turnkeyml

Article By

AMD AI Group

AMD Corporate

Footnotes

Performance tested by AMD on February 2025, using a CRB System with the following specifications: AMD Ryzen AI 9 HX 370, Radeon 890M graphics @ 2.6GHz and 32G DDR5@75000MT/s,NPU Driver version: 32.0.203.240, and GPU Driver version: 32.0.12010.8007. PC manufacturers may vary configurations, yielding different results. Performance may vary based on use of latest drivers. STX-108
Accuracy tested by AMD on February, 2025, using an CRB system with the following specifications: AMD Ryzen AI 9 HX 370, Radeon 890M graphics @ 2.6GHz and 32G DDR5@75000MT/s,NPU Driver version: 32.0.203.240, and GPU Driver version: 32.0.11018.17001. PC manufacturers may vary configurations, yielding different results. Performance may vary based on use of latest drivers. STX-109

Data Center

Business Systems

Personal & Gaming

Embedded

Resources

GPU Accelerators

Adaptive Accelerators

DPU Accelerators

Ethernet Adapters

Workstations

Desktops

Laptops

Resources

Adaptive SoCs & FPGAs

System-on-Modules (SOMs)

Technologies

Resources

Evaluation Boards & Kits

Processor Tools

Graphics Tools & Apps

Adaptive SoC & FPGA Tools

Intellectual Property & Apps

GPU Accelerator Tools & Apps

Overview

For Data Center & Cloud

For Edge & Endpoints

For Developers

Industries

Industries

Industries

Industries

Industries

Workloads

Gaming

Systems

Technologies

Resources

EPYC Processors

Radeon Graphics & AMD Chipsets

Adaptive SoCs & FPGAs

Alveo Accelerators & Kria SOMs

Ryzen Processors

Ethernet Adapters

Overview

Processors

Accelerators

Adaptive SoCs, FPGAs, & SOMs

Graphics

Overview

Resources by Market Segment

Resources by Product

Resources by Type

About Our Partners

AMD Global Support

Processors & Graphics

Accelerators

Adaptive SoCs & FPGAs

Gaming & Personal Computing

Adaptive & Embedded Computing

Get AMD Fan Gear

Shop Our Retail Partners

Accelerate DeepSeek R1 Distilled Models Locally on AMD Ryzen™ AI NPU and iGPU

Overview

DeepSeek Distilled Models

Model Optimization

Performance:

Accuracy Metrics

Conclusion

Next Steps

Additional Resources

Article By

Company

News & Events

Community

Partners

Investors