Performance results with AMD ROCm™ software

This page summarizes performance measurements on AMD Instinct™ GPUs for popular AI models.

The data in the following tables is a reference point to help users evaluate observed performance. It should not be considered as the peak performance that AMD GPUs and ROCm™ software can deliver.

AI Inference
AI Training

AI Inference

Throughput Measurements

The table below shows performance data where a local inference client is fed requests at an infinite rate and shows the throughput client-server scenario under maximum load.

This result is based on the Docker container (rocm/vllm:rocm6.3.1_instinct_vllm0.8.3_20250410), which was released on April 13, 2025.

Model	Precision	TP Size	Input	Output	Num Prompts	Max Num Seqs	Throughput (tokens/s)
Llama 3.1 70B (amd/Llama-3.1-70B-Instruct-FP8-KV)	FP8	8	128	2048	3200	3200	16364.9
			128	4096	1500	1500	12171.0
			500	2000	2000	2000	13290.4
			2048	2048	1500	1500	8216.5
Llama 3.1 405B (amd/Llama-3.1-405B-Instruct-FP8-KV)	FP8	8	128	2048	1500	1500	4331.6
			128	4096	1500	1500	3409.9
			500	2000	2000	2000	3184.0
			2048	2048	500	500	2154.3

TP stands for Tensor Parallelism.

Server: Dual AMD EPYC 9554 64-core processor-based production server with 8x AMD MI300X (192GB HBM3 750W) GPUs, 1 NUMA node per socket, System BIOS 1.8, Ubuntu® 22.04.5 LTS, Host GPU driver ROCm 6.2.2 + amdgpu driver 6.8.5

Reproduce these results on your system by following the instructions in measuring inference performance with vLLM on the AMD GPUs user guide.

Latency Measurements

The table below shows latency measurement, which typically involves assessing the time from when the system receives an input to when the model produces a result.

This result is based on the Docker container (rocm/vllm:rocm6.3.1_instinct_vllm0.8.3_20250410), which was released on April 13, 2025.

Model	Precision	TP Size	Batch Size	Input	Output	MI300X Latency (sec)
Llama 3.1 70B (amd/Llama-3.1-70B-Instruct-FP8-KV)	FP8	8	1	128	2048	17.411
			2	128	2048	18.750
			4	128	2048	19.059
			8	128	2048	20.857
			16	128	2048	22.670
			32	128	2048	25.495
			64	128	2048	34.187
			128	128	2048	48.754
			1	2048	2048	17.699
			2	2048	2048	18.919
			4	2048	2048	19.220
			8	2048	2048	21.545
			16	2048	2048	24.329
			32	2048	2048	29.461
			64	2048	2048	40.148
			128	2048	2048	61.382
Llama 3.1 405B (amd/Llama-3.1-70B-Instruct-FP8-KV)	FP8	8	1	128	2048	46.601
			2	128	2048	46.947
			4	128	2048	48.971
			8	128	2048	53.021
			16	128	2048	55.836
			32	128	2048	64.947
			64	128	2048	81.408
			128	128	2048	115.296
			1	2048	2048	46.998
			2	2048	2048	47.619
			4	2048	2048	51.086
			8	2048	2048	55.706
			16	2048	2048	61.049
			32	2048	2048	75.842
			64	2048	2048	103.074
			128	2048	2048	157.705

TP stands for Tensor Parallelism.

Reproduce these results on your system by following the instructions in measuring inference performance with ROCm vLLM Dcoker on the AMD GPUs user guide.

AI Training

The table below shows training performance data, where the AMD Instinct™ platform measures text generation training throughput with a unique sequence length and batch size. It focuses on TFLOPS per second per GPU.

For FLUX, image generation training throughput from the FLUX.1-dev model with the best batch size before the runs go out of memory, and it focuses on frame per second per GPU.

PyTorch training results on the AMD Instinct™ MI300X platform

This result is based on the Docker container (rocm/pytorch-training:v25.4), which was released on March 10, 2025.

Models	Precision	Batch Size	Sequence Length	TFLOPS/s/GPU
Llama 3.1 70B with FSDP	BF16	4	8192	426.79
Llama 3.1 8B with FSDP	BF16	3	8192	542.94
Llama 3.1 8B with FSDP	FP8	3	8192	737.40
Llama 3.1 8B with FSDP	BF16	6	4096	523.79
Llama 3.1 8B with FSDP	FP8	6	4096	735.44
Mistral 7B with FSDP	BF16	3	8192	483.17
Mistral 7B with FSDP	FP8	4	8192	723.30
FLUX	BF16	10	-	4.51 (FPS/GPU)*

*Note: FLUX performance is measured in FPS/GPU rather than TFLOPS/s/GPU.

Server: Dual AMD EPYC 9654 96-core processor-based production server with 8x AMD MI300X (192GB HBM3 750W) GPUs, 2 NUMA node per socket, System BIOS 5.27, Ubuntu® 22.04.5 LTS, Host GPU driver ROCm 6.3.0 ROCm 6.3 (pre-release)

Reproduce these results on your system by following the instructions in measuring training performance with ROCm PyTorch Docker on the AMD GPUs user guide.

PyTorch training results on the AMD Instinct MI325X platform

This result is based on the Docker container (rocm/pytorch-training:v25.4), which was released on March 10, 2025.

Models	Precision	Batch Size	Sequence Length	TFLOPS/s/GPU
Llama 3.1 70B with FSDP	BF16	7	8192	526.13
Llama 3.1 8B with FSDP	BF16	3	8192	643.01
Llama 3.1 8B with FSDP	FP8	5	8192	893.68
Llama 3.1 8B with FSDP	BF16	8	4096	625.96
Llama 3.1 8B with FSDP	FP8	10	4096	894.98
Mistral 7B with FSDP	BF16	5	8192	590.23
Mistral 7B with FSDP	FP8	6	8192	860.39

Server: Dual AMD EPYC 9654 96-core processor-based production server with 8x AMD MI325X (256GB HBM3E 1000W) GPUs, 2 NUMA node per socket, System BIOS 5.27, Ubuntu® 22.04.5 LTS, Host GPU driver ROCm 6.3.0 ROCm 6.3 (pre-release)

Reproduce these results on your system by following the instructions in measuring training performance with ROCm PyTorch Docker on the AMD GPUs user guide.

Megatron-LM training results on the AMD Instinct™ MI300X platform

This result is based on the Docker container(rocm/megatron-lm:v25.4), which was released on March 18, 2025.

Sequence length 8192
Model	# of nodes	Sequence length	MBS	GBS	Data Type	TP	PP	CP	TFLOPs/s/GPU
llama3.1-8B	1	8192	2	128	FP8	1	1	1	697.91
llama3.1-8B	2	8192	2	256	FP8	1	1	1	690.33
llama3.1-8B	4	8192	2	512	FP8	1	1	1	686.74
llama3.1-8B	8	8192	2	1024	FP8	1	1	1	675.50
Sequence length 4096
Model	# of nodes	Sequence length	MBS	GBS	Data Type	TP	PP	CP	TFLOPs/s/GPU
llama2-7B	1	4096	4	256	FP8	1	1	1	689.90
llama2-7B	2	4096	4	512	FP8	1	1	1	682.04
llama2-7B	4	4096	4	1024	FP8	1	1	1	676.83
llama2-7B	8	4096	4	2048	FP8	1	1	1	686.25

For Deepseek-V2-Lite with 16B parameters, the table below shows training performance data, where the AMD Instinct™ MI300X platform measures text generation training throughput with GEMM tuning was on. It focuses on TFLOPS per second per GPU.

This result is based on the Docker container (rocm/megatron-lm:v25.4), which was released on March 18, 2025.

Model

# of GPUs

Sequence length

MBS

GBS

Data Type

Recompute

TFLOPs/s/GPU

Deespeek-V2-Lite

4096

256

BF16

None

10570

Reproduce these results on your system by following the instructions in measuring training performance with ROCm Megatron-LM Docker on the AMD GPUs user guide.

数据中心

商用系统

个人和游戏

嵌入式产品

资源

加速器

自适应加速器

DPU 加速器

以太网适配器

工作站

台式机

笔记本电脑

资源

自适应 SoC 和 FPGA

模块化系统 (SOM)

技术

开发者资源

评估板与套件

处理器工具

显卡工具和应用

自适应 SoC 和 FPGA

IP 与应用

GPU 加速器工具和应用

概要

面向数据中心和云计算

面向边缘计算和终端

面向开发人员

行业

行业

行业

行业

Industrias

工作负载

游戏

系统

技术

资源

EPYC（霄龙）处理器

Radeon 显卡与 AMD 芯片组

FPGA 和自适应 SoC

Alveo 加速器和 Kria SOM

锐龙处理器

以太网适配器

概要

处理器

加速器

自适应 SoC、FPGA 和 SOM

显卡

概要

资源按市场领域

资源按产品

资源按类型

关于我们的合作伙伴

AMD 全球支持

处理器与显卡

加速器

FPGA 与自适应 SoC

选择我们的零售合作伙伴

自适应和嵌入式计算

Get AMD Fan Gear

Buy Direct From AMD

Buy Direct From AMD

Buy Direct From AMD

Buy Direct From AMD

Buy Direct From AMD

AI Inference

Throughput Measurements

Latency Measurements

AI Training

PyTorch training results on the AMD Instinct™ MI300X platform

PyTorch training results on the AMD Instinct MI325X platform

Megatron-LM training results on the AMD Instinct™ MI300X platform

公司

新闻与活动

AMD 社区

合作伙伴

投资者