AI Inference Performance Validation on AMD Instinct™ MI300X Accelerators Using vLLM Docker Image

Nov 22, 2024

Overview

AMD has a fully optimized vLLM docker image designed for efficient inferencing of Large Language Models (LLMs) on AMD Instinct™ MI300X accelerators. The pre-built docker image optimized with recommended performance settings, offers an efficient solution for validating benchmark performance on MI300X accelerators.

Why Use the vLLM Docker Image?

The vLLM docker image contains ROCm™ 6.2.1, vLLM 0.6.4, and Pytorch 2.5.0 that implements key ease of use optimizations.

Optimizations:

Manages all ROCm dependencies, enabling developers to quickly validate expected inference performance.
Integrates high-performance composable kernel-compatible flash attention v2 custom kernels and modules into vLLM to improve performance.
Skinny GEMM optimization for low batch decoding enabling matrix multiplication processes faster and more efficient when managing small batches.
Pre-configured ready to use tunable ops for FP16/FP8 for faster and more efficient processing for deep learning and numerical simulations application.
Triton-compatible Flash Attention V2 combining efficiency of the Flash Attention with the optimization power of the Triton framework, for faster and scalable attention computation.
ROCm compatible custom paged attention by breaking large sequence of data into manageable “pages” to reduce memory overhead and improve efficiency.
Latest HIPBLASLT library to perform the best linear algebra operations on FP8 data types.

In this blog, we will guide users to quickly validate the expected inference performance numbers on Instinct MI300X accelerators.

Reproducing the Benchmark Results

To run the benchmarking test, we have simplified the process for developers into three simple steps. Depending on the developers need, we have two options to reproduce the results: Model Automation and Dashboarding (MAD) or standalone benchmarking for inference. MAD benchmarking helps developers gather comprehensive performance metrics from a running docker imager. Standalone benchmarking is used when you need to tweak or debug the code or docker during runtime.

MAD-Integrated Benchmarking: Latency and Offline Throughput

Installation

Step 1: Download the docker image on the host machine.

    docker pull rocm/vllm:rocm6.2_mi300_ubuntu20.04_py3.9_vllm_0.6.4

For MAD Integrated Benchmarking

Step 2: Clone the ROCm MAD repository.

    git clone https://github.com/ROCm/MAD

cd MAD

pip install -r requirements.txt

Run the Performance Benchmark Test

Step 3: Run a performance benchmark test (example: Llama3.1 8B).

    export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"

python3 tools/run_models.py --tags pyt_vllm_llama-3.1-8b --keep-model-dir --live-output --timeout 28800

The latency and throughput reports of the model are collected in the following path:

~/run_directory/reports_float16/summary/Meta-Llama-3.1-8B-Instruct_latency_report.csv

~/run_directory/reports_float16/summary/Meta-Llama-3.1-8B-Instruct_throughput_report.csv

To measure the performance of other models, please change the –tags arguments with these:

Model name
pyt_vllm_llama-3.1-8b	pyt_vllm_llama-3.1-8b_fp8
pyt_vllm_llama-3.1-70b	pyt_vllm_llama-3.1-70b_fp8
pyt_vllm_llama-3.1-405b	pyt_vllm_llama-3.1-405b_fp8
pyt_vllm_llama-2-7b
pyt_vllm_llama-2-70b
pyt_vllm_mixtral-8x7b	pyt_vllm_mixtral-8x7b_fp8
pyt_vllm_mixtral-8x22b	pyt_vllm_mixtral-8x22b_fp8
pyt_vllm_mistral-7b
pyt_vllm_qwen2-7b
pyt_vllm_qwen2-72b
pyt_vllm_jais-13b
pyt_vllm_jais-30b

Standalone Benchmarking: Latency and Offline Throughput

Installation

Step 1: Download the docker image on the host machine.

    docker pull rocm/vllm:rocm6.2_mi300_ubuntu20.04_py3.9_vllm_0.6.4

docker run -it --device=/dev/kfd --device=/dev/dri --group-add video --shm-size 128G --security-opt seccomp=unconfined --security-opt apparmor=unconfined --cap-add=SYS_PTRACE -v $(pwd):/workspace --env HUGGINGFACE_HUB_CACHE=/workspace --name vllm_v0.6.4 rocm/vllm:rocm6.2_mi300_ubuntu20.04_py3.9_vllm_0.6.4

Clone the ROCm MAD Repository Inside the Docker Image

Step 2: Now, clone the ROCm MAD repository inside the docker image.

    git clone https://github.com/ROCm/MAD

cd MAD/scripts/vllm

Run the Performance Benchmark Test

Step 3: Run a performance benchmark test (example: Llama3.1 8B).

    export HF_TOKEN="your personal Hugging Face token to access gated models"

./vllm_benchmark_report.sh -s all -m meta-llama/Meta-Llama-3.1-8B-Instruct -g 1 -d float16

The latency and throughput reports of the model are collected in the following path:

    ~/reports_float16/summary/Meta-Llama-3.1-8B-Instruct_latency_report.csv

~/reports_float16/summary/Meta-Llama-3.1-8B-Instruct_throughput_report.csv

To measure the performance of other models under different conditions, please change the –s -m -g – d arguments of ./vllm_benchmark_report.sh -s $test_option -m $model_repo -g $num_gpu -d $datatype with these:

Name	Options	Description
$test_option	latency	Measure decoding token latency
	throughput	Measure token generation throughput
	all	Measure both throughput and latency
$model_repo	meta-llama/Meta-Llama-3.1-8B-Instruct	Llama 3.1 8B
(float16)	meta-llama/Meta-Llama-3.1-70B-Instruct	Llama 3.1 70B
	meta-llama/Meta-Llama-3.1-405B-Instruct	Llama 3.1 405B
	meta-llama/Llama-2-7b-chat-hf	Llama 2 7B
	meta-llama/Llama-2-70b-chat-hf	Llama 2 70B
	mistralai/Mixtral-8x7B-Instruct-v0.1	Mixtral 8x7B
	mistralai/Mixtral-8x22B-Instruct-v0.1	Mixtral 8x22B
	mistralai/Mistral-7B-Instruct-v0.3	Mistral 7B
	Qwen/Qwen2-7B-Instruct	Qwen2 7B
	Qwen/Qwen2-72B-Instruct	Qwen2 72B
	core42/jais-13b-chat	JAIS 13B
	core42/jais-30b-chat-v3	JAIS 30B
$model_repo	amd/Meta-Llama-3.1-8B-Instruct-FP8-KV	Llama 3.1 8B
(float8)	amd/Meta-Llama-3.1-70B-Instruct-FP8-KV	Llama 3.1 70B
	amd/Meta-Llama-3.1-405B-Instruct-FP8-KV	Llama 3.1 405B
	amd/Mixtral-8x7B-Instruct-v0.1-FP8-KV	Mixtral 8x7B
	amd/Mixtral-8x22B-Instruct-v0.1-FP8-KV	Mixtral 8x22B
$num_gpu	1 or 8	Number of GPUs
$datatype	float16, float8	Data type

Benchmarking Conditions

Developers can effortlessly switch between various model types to obtain either a complete set of results or focus on specific outputs to evaluate the model's performance. When assessing performance, it is important to consider both latency and throughput. Latency measures the time taken for the model to process a single request; from the moment the input token is provided until the output token is received. In contrast, offline throughput measures the generated output tokens of multiple requests with different input and output combinations. Users also can change these conditions in ./vllm_benchmark_report.sh

Preconfigured latency measurement conditions

Scenario	Batch size	Input length	Output length
Short prefill	1	128	1
	2	128	1
	4	128	1
	8	128	1
	16	128	1
	32	128	1
	64	128	1
	128	128	1
	256	128	1
Long prefill	1	2048	1
	2	2048	1
	4	2048	1
	8	2048	1
	16	2048	1
	32	2048	1
	64	2048	1
	128	2048	1
	256	2048	1
Decode	1	1	128
	2	1	128
	4	1	128
	8	1	128
	16	1	128
	32	1	128
	64	1	128
	128	1	128
	256	1	128

Preconfigured throughput measurement conditions

Requests	Input length	Output length
30000	128	128
3000	2048	128
3000	128	2048
1500	2048	2048

Building an Interactive Chatbot

Setting up the vLLM Backend Server

Step 1: Download the docker image and the api_server code on the host machine.

    docker pull rocm/vllm:rocm6.2_mi300_ubuntu20.04_py3.9_vllm_0.6.4

wget https://raw.githubusercontent.com/ROCm/vllm/a466f09d7f20ca073f21e3f64b8c9487e4c4ff4b/vllm/entrypoints/sync_openai/api_server.py

Step 2: Run the vLLM server on 4 GPUs (example: Llama3.1 405B FP8).

    DOCKER_IMG=rocm/vllm:rocm6.2_mi300_ubuntu20.04_py3.9_vllm_0.6.4

MODEL=amd/Meta-Llama-3.1-405B-Instruct-FP8-KV

PORT=8080

docker run --rm --device=/dev/kfd --device=/dev/dri --group-add video --shm-size 128G -p $PORT:8011 --security-opt seccomp=unconfined --security-opt apparmor=unconfined --cap-add=SYS_PTRACE -v $(pwd):/workspace --env HUGGINGFACE_HUB_CACHE=/workspace --env VLLM_USE_TRITON_FLASH_ATTN=0 --env PYTORCH_TUNABLEOP_ENABLED=1 --env HF_TOKEN=$HF_TOKEN $DOCKER_IMG python3 /workspace/api_server.py \

    --model $MODEL \

    --swap-space 16 \

    --disable-log-requests \

    --dtype float16 \

    --tensor-parallel-size 4  \

    --host 0.0.0.0 \

    --port 8011 \

    --num-scheduler-steps 1 \

    --distributed-executor-backend "mp"

Step 3: Wait until the server is ready.

    INFO:     Application startup complete.

INFO:     Uvicorn running on http://0.0.0.0:8011 (Press CTRL+C to quit)

Access the vLLM Server from the Client

Step 1: Open new terminal on the host server and access the server from command line.

    curl http://localhost:8080/v1/chat/completions \l http://localhost:8080/v1/chat/completions \

-H "Content-Type: application/json" \

-d '{

"model": "amd/Meta-Llama-3.1-405B-Instruct-FP8-KV",

"messages": [

{"role": "system", "content": "You are a helpful assistant."},

{"role": "user", "content": "Who won the world series in 2020?"}

]

}'

This gives the following response from the vLLM server.

    {"id":"chatcmpl-ce25f1a0b9a84b2b834af904e513ff41","object":"chat.completion","created":1730892805,"model":"amd/Meta-Llama-3.1-405B-Instruct-FP8-KV","choices":[{"index":0,"message":{"role":"assistant","content":"The Los Angeles Dodgers won the 2020 World Series. They defeated the Tampa Bay Rays in the series 4 games to 2, winning the final game on October 27, 2020.","tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":52,"total_tokens":94,"completion_tokens":42},"prompt_logprobs":null}

Step 2: Set up the gradio UXUI on the host server

    wget https://raw.githubusercontent.com/ROCm/vllm/a466f09d7f20ca073f21e3f64b8c9487e4c4ff4b/examples/gradio_openai_chatbot_webserver.py

pip install gradio openai

python3 gradio_openai_chatbot_webserver.py  --model-url http://localhost:8080/v1 --model amd/Meta-Llama-3.1-405B-Instruct-FP8-KV --temp 0.9

And this will create a public URL. The URL https://3f9cdf86f741d62b16.gradio.live will be different.

    * Running on local URL:  http://127.0.0.1:8001

* Running on public URL: https://3f9cdf86f741d62b16.gradio.live

Open the URL https://3f9cdf86f741d62b16.gradio.live on the web browser

What’s Next?

AMD collaborates with innovators like vLLM and other partners to deliver performance and ease of use across diverse workloads. This approach empowers developers to drive innovation, optimize applications, and scale efficiently while benefiting from a flexible, community-driven environment.

Use the above docker image to validate the benchmark AI inference performance numbers on Instinct MI300X accelerators. Additionally, to learn how Firework AI, an AMD partner, used the vLLM docker image for inferencing and benchmarking, read FireAttention V3: Enabling AMD as a Viable Alternative for GPU Inference

Additional resources

For an overview of the optional performance features of vLLM with ROCm software, visit https://github.com/ROCm/vllm/blob/main/ROCm_performance.md.
To learn more about the options for latency and throughput benchmark scripts, visit https://github.com/ROCm/vllm/tree/main/benchmarks.
To stay ahead of the curve in AI innovation for inference, visit AMD Infinity Hub
To learn how to run LLM models from Hugging Face or your own model, visit Using ROCm for AI. To learn how to optimize inference on LLMs, visit Fine-tuning LLMs and inference optimization For a list of other ready-made Docker images for ROCm, check out the ROCm Docker image support matrix.

Article By

Seungrok Jung

Anshul Gupta

数据中心

商用系统

个人和游戏

嵌入式产品

资源

加速器

自适应加速器

DPU 加速器

以太网适配器

工作站

台式机

笔记本电脑

资源

自适应 SoC 和 FPGA

模块化系统 (SOM)

技术

开发者资源

评估板与套件

处理器工具

显卡工具和应用

自适应 SoC 和 FPGA

IP 与应用

GPU 加速器工具和应用

概要

面向数据中心和云计算

面向边缘计算和终端

面向开发人员

行业

行业

行业

行业

Industrias

工作负载

游戏

系统

技术

资源

EPYC（霄龙）处理器

Radeon 显卡与 AMD 芯片组

FPGA 和自适应 SoC

Alveo 加速器和 Kria SOM

锐龙处理器

以太网适配器

概要

处理器

加速器

自适应 SoC、FPGA 和 SOM

显卡

概要

资源按市场领域

资源按产品

资源按类型

关于我们的合作伙伴

AMD 全球支持

处理器与显卡

加速器

FPGA 与自适应 SoC

选择我们的零售合作伙伴

自适应和嵌入式计算

Get AMD Fan Gear

Buy Direct From AMD

Buy Direct From AMD

Buy Direct From AMD

Buy Direct From AMD

Buy Direct From AMD

AI Inference Performance Validation on AMD Instinct™ MI300X Accelerators Using vLLM Docker Image

Overview

Why Use the vLLM Docker Image?

Reproducing the Benchmark Results

MAD-Integrated Benchmarking: Latency and Offline Throughput

Installation

For MAD Integrated Benchmarking

Run the Performance Benchmark Test

Standalone Benchmarking: Latency and Offline Throughput

Installation

Clone the ROCm MAD Repository Inside the Docker Image

Run the Performance Benchmark Test

Benchmarking Conditions

Preconfigured latency measurement conditions

Preconfigured throughput measurement conditions