This page summarizes performance measurements on AMD Instinct™ GPUs for popular AI models.

The data in the following tables is a reference point to help users evaluate observed performance. It should not be considered as the peak performance that AMD GPUs and ROCm™ software can deliver.

AI Inference

Throughput Measurements

The table below shows performance data where a local inference client is fed requests at an infinite rate and shows the throughput client-server scenario under maximum load. 

This result is based on the Docker container (rocm/vllm:rocm6.3.1_instinct_vllm0.8.3_20250410), which was released on April 13, 2025.

Model

Precision

TP Size

Input

Output

Num Prompts

Max Num Seqs

Throughput (tokens/s)

Llama 3.1 70B (amd/Llama-3.1-70B-Instruct-FP8-KV)

FP8

8

128

2048

3200

3200

16364.9

     

128

4096

1500

1500

12171.0

     

500

2000

2000

2000

13290.4

     

2048

2048

1500

1500

8216.5

Llama 3.1 405B (amd/Llama-3.1-405B-Instruct-FP8-KV)

FP8

8

128

2048

1500

1500

4331.6

     

128

4096

1500

1500

3409.9

     

500

2000

2000

2000

3184.0

     

2048

2048

500

500

2154.3

TP stands for Tensor Parallelism.

Server: Dual AMD EPYC 9554 64-core processor-based production server with 8x AMD MI300X (192GB HBM3 750W) GPUs, 1 NUMA node per socket, System BIOS 1.8, Ubuntu® 22.04.5 LTS, Host GPU driver ROCm 6.2.2 + amdgpu driver 6.8.5

Reproduce these results on your system by following the instructions in measuring inference performance with vLLM on the AMD GPUs user guide.

Latency Measurements

The table below shows latency measurement, which typically involves assessing the time from when the system receives an input to when the model produces a result.

This result is based on the Docker container (rocm/vllm:rocm6.3.1_instinct_vllm0.8.3_20250410), which was released on April 13, 2025.

Model

Precision

TP Size

Batch Size

Input

Output

MI300X Latency (sec)

Llama 3.1 70B (amd/Llama-3.1-70B-Instruct-FP8-KV)

FP8

8

1

128

2048

17.411

     

2

128

2048

18.750

     

4

128

2048

19.059

     

8

128

2048

20.857

     

16

128

2048

22.670

     

32

128

2048

25.495

     

64

128

2048

34.187

     

128

128

2048

48.754

     

1

2048

2048

17.699

     

2

2048

2048

18.919

     

4

2048

2048

19.220

     

8

2048

2048

21.545

     

16

2048

2048

24.329

     

32

2048

2048

29.461

     

64

2048

2048

40.148

     

128

2048

2048

61.382

Llama 3.1 405B (amd/Llama-3.1-70B-Instruct-FP8-KV)

FP8

8

1

128

2048

46.601

     

2

128

2048

46.947

     

4

128

2048

48.971

     

8

128

2048

53.021

     

16

128

2048

55.836

     

32

128

2048

64.947

     

64

128

2048

81.408

     

128

128

2048

115.296

     

1

2048

2048

46.998

     

2

2048

2048

47.619

     

4

2048

2048

51.086

     

8

2048

2048

55.706

     

16

2048

2048

61.049

     

32

2048

2048

75.842

     

64

2048

2048

103.074

     

128

2048

2048

157.705

TP stands for Tensor Parallelism.

Server: Dual AMD EPYC 9554 64-core processor-based production server with 8x AMD MI300X (192GB HBM3 750W) GPUs, 1 NUMA node per socket, System BIOS 1.8, Ubuntu® 22.04.5 LTS, Host GPU driver ROCm 6.2.2 + amdgpu driver 6.8.5 

Reproduce these results on your system by following the instructions in measuring inference performance with ROCm vLLM Dcoker on the AMD GPUs user guide.

AI Training

The table below shows training performance data, where the AMD Instinct™ platform measures text generation training throughput with a unique sequence length and batch size. It focuses on TFLOPS per second per GPU.

For FLUX, image generation training throughput from the FLUX.1-dev model with the best batch size before the runs go out of memory, and it focuses on frame per second per GPU.

PyTorch training results on the AMD Instinct™ MI300X platform

This result is based on the Docker container (rocm/pytorch-training:v25.4), which was released on March 10, 2025.

Models

Precision

Batch Size

Sequence Length

TFLOPS/s/GPU

Llama 3.1 70B with FSDP

BF16

4

8192

426.79

Llama 3.1 8B with FSDP

BF16

3

8192

542.94

Llama 3.1 8B with FSDP

FP8

3

8192

737.40

Llama 3.1 8B with FSDP

BF16

6

4096

523.79

Llama 3.1 8B with FSDP

FP8

6

4096

735.44

Mistral 7B with FSDP

BF16

3

8192

483.17

Mistral 7B with FSDP

FP8

4

8192

723.30

FLUX

BF16

10

-

4.51 (FPS/GPU)*

*Note: FLUX performance is measured in FPS/GPU rather than TFLOPS/s/GPU.

Server: Dual AMD EPYC 9654 96-core processor-based production server with 8x AMD MI300X (192GB HBM3 750W) GPUs, 2 NUMA node per socket, System BIOS 5.27, Ubuntu® 22.04.5 LTS, Host GPU driver ROCm 6.3.0 ROCm 6.3 (pre-release)

Reproduce these results on your system by following the instructions in measuring training performance with ROCm PyTorch Docker on the AMD GPUs user guide.

PyTorch training results on the AMD Instinct MI325X platform

This result is based on the Docker container (rocm/pytorch-training:v25.4), which was released on March 10, 2025.

Models

Precision

Batch Size

Sequence Length

TFLOPS/s/GPU

Llama 3.1 70B with FSDP

BF16

7

8192

526.13

Llama 3.1 8B with FSDP

BF16

3

8192

643.01

Llama 3.1 8B with FSDP

FP8

5

8192

893.68

Llama 3.1 8B with FSDP

BF16

8

4096

625.96

Llama 3.1 8B with FSDP

FP8

10

4096

894.98

Mistral 7B with FSDP

BF16

5

8192

590.23

Mistral 7B with FSDP

FP8

6

8192

860.39

Server: Dual AMD EPYC 9654 96-core processor-based production server with 8x AMD MI325X (256GB HBM3E 1000W) GPUs, 2 NUMA node per socket, System BIOS 5.27, Ubuntu® 22.04.5 LTS, Host GPU driver ROCm 6.3.0 ROCm 6.3 (pre-release)

Reproduce these results on your system by following the instructions in measuring training performance with ROCm PyTorch Docker on the AMD GPUs user guide.

Megatron-LM training results on the AMD Instinct™ MI300X platform

This result is based on the Docker container(rocm/megatron-lm:v25.4), which was released on March 18, 2025.

Sequence length 8192

Model

# of nodes

Sequence length

MBS   

GBS   

Data Type

TP    

PP    

CP    

TFLOPs/s/GPU 

llama3.1-8B

1

8192

2

128

FP8

1

1

1

697.91

llama3.1-8B

2

8192

2

256

FP8

1

1

1

690.33

llama3.1-8B

4

8192

2

512

FP8

1

1

1

686.74

llama3.1-8B

8

8192

2

1024

FP8

1

1

1

675.50

Sequence length 4096

Model

# of nodes

Sequence length

MBS   

GBS   

Data Type

TP    

PP    

CP    

TFLOPs/s/GPU

llama2-7B

1

4096

4

256

FP8

1

1

1

689.90

llama2-7B

2

4096

4

512

FP8

1

1

1

682.04

llama2-7B

4

4096

4

1024

FP8

1

1

1

676.83

llama2-7B

8

4096

4

2048

FP8

1

1

1

686.25

Server: Dual AMD EPYC 9654 96-core processor-based production server with 8x AMD MI300X (192GB HBM3 750W) GPUs, 2 NUMA node per socket, System BIOS 5.27, Ubuntu® 22.04.5 LTS, Host GPU driver ROCm 6.3.0 ROCm 6.3 (pre-release)

For Deepseek-V2-Lite with 16B parameters, the table below shows training performance data, where the AMD Instinct™ MI300X platform measures text generation training throughput with GEMM tuning was on. It focuses on TFLOPS per second per GPU.  

This result is based on the Docker container (rocm/megatron-lm:v25.4), which was released on March 18, 2025.

Model

# of GPUs

Sequence length

MBS

GBS

Data Type

TP

PP

CP

EP

SP

Recompute

TFLOPs/s/GPU

Deespeek-V2-Lite

8

4096

4

256

BF16

1

1

1

8

On

None

10570

Server: Dual AMD EPYC 9654 96-core processor-based production server with 8x AMD MI300X (192GB HBM3 750W) GPUs, 2 NUMA node per socket, System BIOS 5.27, Ubuntu® 22.04.5 LTS, Host GPU driver ROCm 6.3.0 ROCm 6.3 (pre-release)

Reproduce these results on your system by following the instructions in measuring training performance with ROCm Megatron-LM Docker on the AMD GPUs user guide.