This page summarizes performance measurements on AMD Instinct™ GPUs for popular AI models.
The data in the following tables is a reference point to help users evaluate observed performance. It should not be considered as the peak performance that AMD GPUs and ROCm™ software can deliver.
- AI Inference
- AI Training
AI Inference
Throughput Measurements
The table below shows performance data where a local inference client is fed requests at an infinite rate and shows the throughput client-server scenario under maximum load.
This result is based on the Docker container (rocm/vllm:rocm6.3.1_instinct_vllm0.8.3_20250410), which was released on April 13, 2025.
Model |
Precision |
TP Size |
Input |
Output |
Num Prompts |
Max Num Seqs |
Throughput (tokens/s) |
Llama 3.1 70B (amd/Llama-3.1-70B-Instruct-FP8-KV) |
FP8 |
8 |
128 |
2048 |
3200 |
3200 |
16364.9 |
128 |
4096 |
1500 |
1500 |
12171.0 |
|||
500 |
2000 |
2000 |
2000 |
13290.4 |
|||
2048 |
2048 |
1500 |
1500 |
8216.5 |
|||
Llama 3.1 405B (amd/Llama-3.1-405B-Instruct-FP8-KV) |
FP8 |
8 |
128 |
2048 |
1500 |
1500 |
4331.6 |
128 |
4096 |
1500 |
1500 |
3409.9 |
|||
500 |
2000 |
2000 |
2000 |
3184.0 |
|||
2048 |
2048 |
500 |
500 |
2154.3 |
TP stands for Tensor Parallelism.
Server: Dual AMD EPYC 9554 64-core processor-based production server with 8x AMD MI300X (192GB HBM3 750W) GPUs, 1 NUMA node per socket, System BIOS 1.8, Ubuntu® 22.04.5 LTS, Host GPU driver ROCm 6.2.2 + amdgpu driver 6.8.5
Reproduce these results on your system by following the instructions in measuring inference performance with vLLM on the AMD GPUs user guide.
Latency Measurements
The table below shows latency measurement, which typically involves assessing the time from when the system receives an input to when the model produces a result.
This result is based on the Docker container (rocm/vllm:rocm6.3.1_instinct_vllm0.8.3_20250410), which was released on April 13, 2025.
Model |
Precision |
TP Size |
Batch Size |
Input |
Output |
MI300X Latency (sec) |
Llama 3.1 70B (amd/Llama-3.1-70B-Instruct-FP8-KV) |
FP8 |
8 |
1 |
128 |
2048 |
17.411 |
2 |
128 |
2048 |
18.750 |
|||
4 |
128 |
2048 |
19.059 |
|||
8 |
128 |
2048 |
20.857 |
|||
16 |
128 |
2048 |
22.670 |
|||
32 |
128 |
2048 |
25.495 |
|||
64 |
128 |
2048 |
34.187 |
|||
128 |
128 |
2048 |
48.754 |
|||
1 |
2048 |
2048 |
17.699 |
|||
2 |
2048 |
2048 |
18.919 |
|||
4 |
2048 |
2048 |
19.220 |
|||
8 |
2048 |
2048 |
21.545 |
|||
16 |
2048 |
2048 |
24.329 |
|||
32 |
2048 |
2048 |
29.461 |
|||
64 |
2048 |
2048 |
40.148 |
|||
128 |
2048 |
2048 |
61.382 |
|||
Llama 3.1 405B (amd/Llama-3.1-70B-Instruct-FP8-KV) |
FP8 |
8 |
1 |
128 |
2048 |
46.601 |
2 |
128 |
2048 |
46.947 |
|||
4 |
128 |
2048 |
48.971 |
|||
8 |
128 |
2048 |
53.021 |
|||
16 |
128 |
2048 |
55.836 |
|||
32 |
128 |
2048 |
64.947 |
|||
64 |
128 |
2048 |
81.408 |
|||
128 |
128 |
2048 |
115.296 |
|||
1 |
2048 |
2048 |
46.998 |
|||
2 |
2048 |
2048 |
47.619 |
|||
4 |
2048 |
2048 |
51.086 |
|||
8 |
2048 |
2048 |
55.706 |
|||
16 |
2048 |
2048 |
61.049 |
|||
32 |
2048 |
2048 |
75.842 |
|||
64 |
2048 |
2048 |
103.074 |
|||
128 |
2048 |
2048 |
157.705 |
TP stands for Tensor Parallelism.
Server: Dual AMD EPYC 9554 64-core processor-based production server with 8x AMD MI300X (192GB HBM3 750W) GPUs, 1 NUMA node per socket, System BIOS 1.8, Ubuntu® 22.04.5 LTS, Host GPU driver ROCm 6.2.2 + amdgpu driver 6.8.5
Reproduce these results on your system by following the instructions in measuring inference performance with ROCm vLLM Dcoker on the AMD GPUs user guide.
AI Training
The table below shows training performance data, where the AMD Instinct™ platform measures text generation training throughput with a unique sequence length and batch size. It focuses on TFLOPS per second per GPU.
For FLUX, image generation training throughput from the FLUX.1-dev model with the best batch size before the runs go out of memory, and it focuses on frame per second per GPU.
PyTorch training results on the AMD Instinct™ MI300X platform
This result is based on the Docker container (rocm/pytorch-training:v25.4), which was released on March 10, 2025.
Models |
Precision |
Batch Size |
Sequence Length |
TFLOPS/s/GPU |
Llama 3.1 70B with FSDP |
BF16 |
4 |
8192 |
426.79 |
Llama 3.1 8B with FSDP |
BF16 |
3 |
8192 |
542.94 |
Llama 3.1 8B with FSDP |
FP8 |
3 |
8192 |
737.40 |
Llama 3.1 8B with FSDP |
BF16 |
6 |
4096 |
523.79 |
Llama 3.1 8B with FSDP |
FP8 |
6 |
4096 |
735.44 |
Mistral 7B with FSDP |
BF16 |
3 |
8192 |
483.17 |
Mistral 7B with FSDP |
FP8 |
4 |
8192 |
723.30 |
FLUX |
BF16 |
10 |
- |
4.51 (FPS/GPU)* |
*Note: FLUX performance is measured in FPS/GPU rather than TFLOPS/s/GPU.
Server: Dual AMD EPYC 9654 96-core processor-based production server with 8x AMD MI300X (192GB HBM3 750W) GPUs, 2 NUMA node per socket, System BIOS 5.27, Ubuntu® 22.04.5 LTS, Host GPU driver ROCm 6.3.0 ROCm 6.3 (pre-release)
Reproduce these results on your system by following the instructions in measuring training performance with ROCm PyTorch Docker on the AMD GPUs user guide.
PyTorch training results on the AMD Instinct MI325X platform
This result is based on the Docker container (rocm/pytorch-training:v25.4), which was released on March 10, 2025.
Models |
Precision |
Batch Size |
Sequence Length |
TFLOPS/s/GPU |
Llama 3.1 70B with FSDP |
BF16 |
7 |
8192 |
526.13 |
Llama 3.1 8B with FSDP |
BF16 |
3 |
8192 |
643.01 |
Llama 3.1 8B with FSDP |
FP8 |
5 |
8192 |
893.68 |
Llama 3.1 8B with FSDP |
BF16 |
8 |
4096 |
625.96 |
Llama 3.1 8B with FSDP |
FP8 |
10 |
4096 |
894.98 |
Mistral 7B with FSDP |
BF16 |
5 |
8192 |
590.23 |
Mistral 7B with FSDP |
FP8 |
6 |
8192 |
860.39 |
Server: Dual AMD EPYC 9654 96-core processor-based production server with 8x AMD MI325X (256GB HBM3E 1000W) GPUs, 2 NUMA node per socket, System BIOS 5.27, Ubuntu® 22.04.5 LTS, Host GPU driver ROCm 6.3.0 ROCm 6.3 (pre-release)
Reproduce these results on your system by following the instructions in measuring training performance with ROCm PyTorch Docker on the AMD GPUs user guide.
Megatron-LM training results on the AMD Instinct™ MI300X platform
This result is based on the Docker container(rocm/megatron-lm:v25.4), which was released on March 18, 2025.
Sequence length 8192 |
|||||||||
Model |
# of nodes |
Sequence length |
MBS |
GBS |
Data Type |
TP |
PP |
CP |
TFLOPs/s/GPU |
llama3.1-8B |
1 |
8192 |
2 |
128 |
FP8 |
1 |
1 |
1 |
697.91 |
llama3.1-8B |
2 |
8192 |
2 |
256 |
FP8 |
1 |
1 |
1 |
690.33 |
llama3.1-8B |
4 |
8192 |
2 |
512 |
FP8 |
1 |
1 |
1 |
686.74 |
llama3.1-8B |
8 |
8192 |
2 |
1024 |
FP8 |
1 |
1 |
1 |
675.50 |
Sequence length 4096 |
|||||||||
Model |
# of nodes |
Sequence length |
MBS |
GBS |
Data Type |
TP |
PP |
CP |
TFLOPs/s/GPU |
llama2-7B |
1 |
4096 |
4 |
256 |
FP8 |
1 |
1 |
1 |
689.90 |
llama2-7B |
2 |
4096 |
4 |
512 |
FP8 |
1 |
1 |
1 |
682.04 |
llama2-7B |
4 |
4096 |
4 |
1024 |
FP8 |
1 |
1 |
1 |
676.83 |
llama2-7B |
8 |
4096 |
4 |
2048 |
FP8 |
1 |
1 |
1 |
686.25 |
Server: Dual AMD EPYC 9654 96-core processor-based production server with 8x AMD MI300X (192GB HBM3 750W) GPUs, 2 NUMA node per socket, System BIOS 5.27, Ubuntu® 22.04.5 LTS, Host GPU driver ROCm 6.3.0 ROCm 6.3 (pre-release)
For Deepseek-V2-Lite with 16B parameters, the table below shows training performance data, where the AMD Instinct™ MI300X platform measures text generation training throughput with GEMM tuning was on. It focuses on TFLOPS per second per GPU.
This result is based on the Docker container (rocm/megatron-lm:v25.4), which was released on March 18, 2025.
Model |
# of GPUs |
Sequence length |
MBS |
GBS |
Data Type |
TP |
PP |
CP |
EP |
SP |
Recompute |
TFLOPs/s/GPU |
Deespeek-V2-Lite |
8 |
4096 |
4 |
256 |
BF16 |
1 |
1 |
1 |
8 |
On |
None |
10570 |
Server: Dual AMD EPYC 9654 96-core processor-based production server with 8x AMD MI300X (192GB HBM3 750W) GPUs, 2 NUMA node per socket, System BIOS 5.27, Ubuntu® 22.04.5 LTS, Host GPU driver ROCm 6.3.0 ROCm 6.3 (pre-release)
Reproduce these results on your system by following the instructions in measuring training performance with ROCm Megatron-LM Docker on the AMD GPUs user guide.