What’s New in AMD ROCm 6.4: Breakthrough Inference, Plug-and-Play Containers & Modular Deployment for Scalable AI on AMD Instinct GPUs

Apr 14, 2025

Illustration showing a touch screen with abstract background

The scale and complexity of modern AI workloads continue to grow—but so do the expectations around performance and ease of deployment. ROCm 6.4 is a leap forward for organizations building the future of AI and HPC on AMD Instinct™ GPUs. With growing support across leading AI frameworks, optimized containers, and modular infrastructure tools, ROCm software continues to gain momentum—empowering customers to innovate faster, operate smarter, and stay in control of their AI infrastructure.

Whether you're deploying inference across multi-node clusters, training multi-billion parameter models or managing large GPU clusters, ROCm 6.4 software offers a seamless path to high performance with AMD Instinct GPUs.

This blog spotlights five key innovations in ROCm 6.4 that directly address common challenges faced by AI researchers, model developers, and infrastructure teams—making AI development fast, simple, and scalable.

1. ROCm Containers for Training and Inference: Plug-and-Play AI on Instinct GPUs

Setting up and maintaining optimized environments for training and inference is time-consuming, error-prone, and slows down iteration cycles. ROCm 6.4 software introduces a powerful suite of ready-to-run, pre-optimized containers for both training and inference—designed specifically for AMD Instinct GPUs.

vLLM (Inference Container) – Built for low-latency LLM inference with plug-and-play support for open models such as the latest Gemma 3 (day-0), Llama, Mistral, Cohere, and more. Read about Gemma 3 on Instinct GPUs here. Other Relevant Links:Docker Container,User Guide,Performance Numbers

SGLang (Inference Container) – Optimized for DeepSeek R1 and agentic workflows, delivering great throughput and efficiency with DeepGEMM, FP8 support, and parallel multi-head attention. Key SGLang Resources:Docker Container,User Guide

PyTorch (Training Container) – Includes performance-tuned builds of PyTorch with support for advanced attention mechanisms, helping enable seamless LLM training on AMD Instinct MI300X GPUs. Now optimized for Llama 3.1 (8B, 70B), Llama 2 (70B), and FLUX.1-dev. Access Pytorch Training Docker for ROCm and training resources hereDocker Container,User Guide,Performance Numbers,Performance Validation.

Megatron-LM (Training Container) – A custom ROCm-tuned fork of Megatron-LM designed to efficiently train large-scale language models, including Llama 3.1, Llama 2, and DeepSeek-V2-Lite. Access Megatron-LM Docker and training resources here:Docker Container,User Guide,Performance Numbers,Performance Validation

These containers provide AI researchers with faster access to turnkey environments for evaluating new models and running experiments. Model developers can take advantage of pre-tuned support for today’s most advanced LLMs—including Llama 3.1, Gemma 3, and DeepSeek—without needing to spend time on complex configuration. And for infrastructure teams, these containers deliver consistent, reproducible deployment across development, testing, and production environments, enabling smoother scale-out and simplified maintenance.

2. PyTorch for ROCm Gets a Major Upgrade: Faster Attention for Faster Training

Training large language models (LLMs) continues to push the limits of compute and memory—and inefficient attention mechanisms can quickly become a major bottleneck, slowing iteration and increasing infrastructure costs. ROCm 6.4 software delivers major performance enhancements within the PyTorch framework, including optimized Flex Attention, TopK, and Scaled Dot-Product Attention (SDPA).

Flex Attention: Delivers a significant performance leap over ROCm 6.3, dramatically reducing training time and memory overhead—especially in LLM workloads that rely on advanced attention mechanisms.
TopK: TopK operations now run up to 3x faster, accelerating inference response time while preserving output quality (source)
SDPA: smoother, long-context inference.

These improvements translate into fast training times, reduced memory overhead, and more efficient hardware utilization. As a result, AI researchers can run more experiments in less time, model developers can fine-tune larger models more efficiently, and ultimately Instinct GPU customers benefit from lower time-to-train and improved return on infrastructure investments.

These upgrades are available out of the box in the ROCm PyTorch container. To learn more about Pytorch for ROCm Training, read the bloghere.

3. Next-Gen Inference Performance on AMD Instinct GPUs with SGLang and vLLM

Delivering low-latency, high-throughput inference for large language models is a constant challenge—especially as new models emerge and expectations around deployment speed rise. ROCm 6.4 addresses this head-on with inference-optimized builds of vLLM and SGLang, specifically tuned for AMD Instinct GPUs. With robust support for leading models like Grok, DeepSeek R1, Gemma 3, Llama 3.1 (8B, 70B, 405B), and this release empowers AI researchers to achieve faster time-to-results on large-scale benchmarks, while model developers can deploy real-world inference pipelines with minimal tuning or rework. Meanwhile, infrastructure teams benefit from stable, production-ready containers with weekly updates—helping ensure performance, reliability, and consistency at scale.

SGLang with DeepSeek R1: Achieved Record-setting throughput on Instinct MI300X

>> Read the blog:

Supercharge DeepSeek-R1 Inference on AMD Instinct MI300X — ROCm Blogs

vLLM with Gemma 3: Day-0 support for seamless deployment on Instinct GPUs >>

>>Dive into Gemma 3 deployment on Instinct AMD GPUs

Together, these tools provide a full-stack inference environment, with stable and dev containers updated bi-weekly and weekly, respectively.

4. Seamless Instinct GPU Cluster Management with AMD GPU Operator

Scaling and managing GPU workloads across Kubernetes clusters often involves manual driver updates, operational downtime, and limited visibility into GPU health—all of which can hinder performance and reliability. With ROCm 6.4, the AMD GPU Operator brings automation to GPU scheduling, driver lifecycle management, and real-time telemetry—streamlining cluster operations end-to-end. This means infrastructure teams can perform upgrades with minimal disruption, AI and HPC administrators can confidently deploy AMD Instinct GPUs in air-gapped and secure environments with full observability, and Instinct customers benefit from higher uptime, reduced operational risk, and more resilient AI infrastructure.

New features include:

Automated cordon, drain, reboot for rolling updates.
Expanded support for Red Hat OpenShift 4.16–4.17 and Ubuntu 22.04/24.04, helping ensure compatibility with modern cloud and enterprise environments.
Prometheus-based Device Metrics Exporter for real-time health tracking.

>> Learn more aboutGPU Operatorfrom the bloghere.

5. Software Modularity with the New Instinct GPU Driver

Coupled driver stacks slow down upgrade cycles, increase maintenance risk, and reduce compatibility across environments. ROCm 6.4 software introduces the Instinct GPU Driver, a modular driver architecture that separates the kernel driver from ROCm user space.

Key benefits,

Infra teams can now update drivers or ROCm libraries independently.
Longer 12-month compatibility window (vs. 6 months in prior releases)
More flexible deployment across bare metal, containers, and ISV apps

This reduces risk of breaking changes and simplifies fleet-wide updates—especially useful for cloud providers, government orgs, and enterprises with strict SLAs.

>> Learn More About ROCm getting Modular Here

Bonus Highlight: AITER for Inference Acceleration

ROCm 6.4 software includes AITER, a high-performance inference library with drop-in, pre-optimized kernels—no manual tuning required.

Delivers (source):

Up to 17X faster decoder execution
14X gains in multi-head attention
2X LLM inference throughput

>> Read the full AITER blog here

Ready to Take the Leap?

Explore the full potential of ROCm 6.4 software and see how AMD Instinct GPUs can power your next big breakthrough.TheROCm Documentation Huband other avenues are currently being updated with the latest ROCm 6.4 content—details will be available very soon, so stay tuned!

>> Read the comprehensive ROCm 6.4 feature enhancements blog here

Stay updated with the latest developments, tips, and insights by visitingAMD ROCm Blogs. Don’t forget to sign up for the RSS feed to receive regular updates directly to your inbox.

Key Contributors:

Ronnie Chatterjee – Director Product Management

Jayacharan Kolla – Product Manager

Aditya Bhattacharji - Software Development Engineer

Saad Rahim – SMTS Software Development Engineer

Farshad Ghodsian -SMTS Product Application Eng.

Marco Grond - Developer Relations Manager

Article By

Ronak Shah

white pearl gradient medium color divider

Related Blogs

View All Blogs

Centro de datos

Sistemas Comerciales

Dispositivos personales y para gaming

Productos Integrados

Recursos

Aceleradores de GPU

Aceleradores Adaptables

Aceleradores de DPU

Adaptadores de ethernet

Workstations

Equipos de Escritorio

Computadoras Portátiles

Recursos

FPGA y SoC Adaptables

Sistemas en Módulos (SOM)

Tecnologías

Recursos para el Desarrollador

Placas y Kits de Prueba

Herramientas para Procesadores

Herramientas y Aplicaciones para Tarjetas Gráficas

Herramientas de FPGA y SoC Adaptables

Propiedad Intelectual y Aplicaciones

Herramientas y Apps para Aceleradores de GPU

Resumen

Para centros de datos y la nube

Para el borde y los puntos de conexión

Para desarrolladores

Industrias

Industrias

Industrias

Industrias

Industrias

Cargas de Trabajo

Juegos

Sistemas

Tecnologías

Recursos

Procesadores EPYC

Tarjetas gráficas Radeon y chipsets AMD

FPGA y SoC Adaptables

Aceleradores Alveo y SOM Kria

Procesadores Ryzen

Adaptadores de Ethernet

Resumen

Procesadores

Aceleradores

SOM, FPGA y SoC adaptables

Tarjetas Gráficas

Página de inicio del Centro para socios

Recursos por segmento de mercado

Recursos por producto

Recursos por tipo

Acerca de nuestros socios

Soporte global de AMD

Procesadores y Tarjetas Gráficas

Aceleradores

FPGA y SoC Adaptables

Experiencia de juego y computación personal

Informática embebida y adaptable

Get AMD Fan Gear

Buy Direct From AMD

Buy Direct From AMD

Buy Direct From AMD

Buy Direct From AMD

Buy Direct From AMD

What’s New in AMD ROCm 6.4: Breakthrough Inference, Plug-and-Play Containers & Modular Deployment for Scalable AI on AMD Instinct GPUs

1. ROCm Containers for Training and Inference: Plug-and-Play AI on Instinct GPUs

2. PyTorch for ROCm Gets a Major Upgrade: Faster Attention for Faster Training

3. Next-Gen Inference Performance on AMD Instinct GPUs with SGLang and vLLM

4. Seamless Instinct GPU Cluster Management with AMD GPU Operator

5. Software Modularity with the New Instinct GPU Driver

Bonus Highlight: AITER for Inference Acceleration

Ready to Take the Leap?

Article By

Related Blogs