What’s New in AMD ROCm 6.4: Breakthrough Inference, Plug-and-Play Containers & Modular Deployment for Scalable AI on AMD Instinct GPUs
Apr 14, 2025

The scale and complexity of modern AI workloads continue to grow—but so do the expectations around performance and ease of deployment. ROCm 6.4 is a leap forward for organizations building the future of AI and HPC on AMD Instinct™ GPUs. With growing support across leading AI frameworks, optimized containers, and modular infrastructure tools, ROCm software continues to gain momentum—empowering customers to innovate faster, operate smarter, and stay in control of their AI infrastructure.
Whether you're deploying inference across multi-node clusters, training multi-billion parameter models or managing large GPU clusters, ROCm 6.4 software offers a seamless path to high performance with AMD Instinct GPUs.
This blog spotlights five key innovations in ROCm 6.4 that directly address common challenges faced by AI researchers, model developers, and infrastructure teams—making AI development fast, simple, and scalable.
1. ROCm Containers for Training and Inference: Plug-and-Play AI on Instinct GPUs
Setting up and maintaining optimized environments for training and inference is time-consuming, error-prone, and slows down iteration cycles. ROCm 6.4 software introduces a powerful suite of ready-to-run, pre-optimized containers for both training and inference—designed specifically for AMD Instinct GPUs.
- vLLM (Inference Container) – Built for low-latency LLM inference with plug-and-play support for open models such as the latest Gemma 3 (day-0), Llama, Mistral, Cohere, and more. Read about Gemma 3 on Instinct GPUs here. Other Relevant Links:Docker Container,User Guide,Performance Numbers
- SGLang (Inference Container) – Optimized for DeepSeek R1 and agentic workflows, delivering great throughput and efficiency with DeepGEMM, FP8 support, and parallel multi-head attention. Key SGLang Resources:Docker Container,User Guide
- PyTorch (Training Container) – Includes performance-tuned builds of PyTorch with support for advanced attention mechanisms, helping enable seamless LLM training on AMD Instinct MI300X GPUs. Now optimized for Llama 3.1 (8B, 70B), Llama 2 (70B), and FLUX.1-dev. Access Pytorch Training Docker for ROCm and training resources hereDocker Container,User Guide,Performance Numbers,Performance Validation.
- Megatron-LM (Training Container) – A custom ROCm-tuned fork of Megatron-LM designed to efficiently train large-scale language models, including Llama 3.1, Llama 2, and DeepSeek-V2-Lite. Access Megatron-LM Docker and training resources here:Docker Container,User Guide,Performance Numbers,Performance Validation
These containers provide AI researchers with faster access to turnkey environments for evaluating new models and running experiments. Model developers can take advantage of pre-tuned support for today’s most advanced LLMs—including Llama 3.1, Gemma 3, and DeepSeek—without needing to spend time on complex configuration. And for infrastructure teams, these containers deliver consistent, reproducible deployment across development, testing, and production environments, enabling smoother scale-out and simplified maintenance.
2. PyTorch for ROCm Gets a Major Upgrade: Faster Attention for Faster Training
Training large language models (LLMs) continues to push the limits of compute and memory—and inefficient attention mechanisms can quickly become a major bottleneck, slowing iteration and increasing infrastructure costs. ROCm 6.4 software delivers major performance enhancements within the PyTorch framework, including optimized Flex Attention, TopK, and Scaled Dot-Product Attention (SDPA).
- Flex Attention: Delivers a significant performance leap over ROCm 6.3, dramatically reducing training time and memory overhead—especially in LLM workloads that rely on advanced attention mechanisms.
- TopK: TopK operations now run up to 3x faster, accelerating inference response time while preserving output quality (source)
- SDPA: smoother, long-context inference.
These improvements translate into fast training times, reduced memory overhead, and more efficient hardware utilization. As a result, AI researchers can run more experiments in less time, model developers can fine-tune larger models more efficiently, and ultimately Instinct GPU customers benefit from lower time-to-train and improved return on infrastructure investments.
These upgrades are available out of the box in the ROCm PyTorch container. To learn more about Pytorch for ROCm Training, read the bloghere.
3. Next-Gen Inference Performance on AMD Instinct GPUs with SGLang and vLLM
Delivering low-latency, high-throughput inference for large language models is a constant challenge—especially as new models emerge and expectations around deployment speed rise. ROCm 6.4 addresses this head-on with inference-optimized builds of vLLM and SGLang, specifically tuned for AMD Instinct GPUs. With robust support for leading models like Grok, DeepSeek R1, Gemma 3, Llama 3.1 (8B, 70B, 405B), and this release empowers AI researchers to achieve faster time-to-results on large-scale benchmarks, while model developers can deploy real-world inference pipelines with minimal tuning or rework. Meanwhile, infrastructure teams benefit from stable, production-ready containers with weekly updates—helping ensure performance, reliability, and consistency at scale.
- SGLang with DeepSeek R1: Achieved Record-setting throughput on Instinct MI300X
>> Read the blog:
- vLLM with Gemma 3: Day-0 support for seamless deployment on Instinct GPUs >>
>>Dive into Gemma 3 deployment on Instinct AMD GPUs
Together, these tools provide a full-stack inference environment, with stable and dev containers updated bi-weekly and weekly, respectively.
4. Seamless Instinct GPU Cluster Management with AMD GPU Operator
Scaling and managing GPU workloads across Kubernetes clusters often involves manual driver updates, operational downtime, and limited visibility into GPU health—all of which can hinder performance and reliability. With ROCm 6.4, the AMD GPU Operator brings automation to GPU scheduling, driver lifecycle management, and real-time telemetry—streamlining cluster operations end-to-end. This means infrastructure teams can perform upgrades with minimal disruption, AI and HPC administrators can confidently deploy AMD Instinct GPUs in air-gapped and secure environments with full observability, and Instinct customers benefit from higher uptime, reduced operational risk, and more resilient AI infrastructure.
New features include:
- Automated cordon, drain, reboot for rolling updates.
- Expanded support for Red Hat OpenShift 4.16–4.17 and Ubuntu 22.04/24.04, helping ensure compatibility with modern cloud and enterprise environments.
- Prometheus-based Device Metrics Exporter for real-time health tracking.
>> Learn more aboutGPU Operatorfrom the bloghere.
5. Software Modularity with the New Instinct GPU Driver
Coupled driver stacks slow down upgrade cycles, increase maintenance risk, and reduce compatibility across environments. ROCm 6.4 software introduces the Instinct GPU Driver, a modular driver architecture that separates the kernel driver from ROCm user space.
Key benefits,
- Infra teams can now update drivers or ROCm libraries independently.
- Longer 12-month compatibility window (vs. 6 months in prior releases)
- More flexible deployment across bare metal, containers, and ISV apps
This reduces risk of breaking changes and simplifies fleet-wide updates—especially useful for cloud providers, government orgs, and enterprises with strict SLAs.
>> Learn More About ROCm getting Modular Here
Bonus Highlight: AITER for Inference Acceleration
ROCm 6.4 software includes AITER, a high-performance inference library with drop-in, pre-optimized kernels—no manual tuning required.
Delivers (source):
- Up to 17X faster decoder execution
- 14X gains in multi-head attention
- 2X LLM inference throughput
>> Read the full AITER blog here
Ready to Take the Leap?
Explore the full potential of ROCm 6.4 software and see how AMD Instinct GPUs can power your next big breakthrough.TheROCm Documentation Huband other avenues are currently being updated with the latest ROCm 6.4 content—details will be available very soon, so stay tuned!
>> Read the comprehensive ROCm 6.4 feature enhancements blog here
Stay updated with the latest developments, tips, and insights by visitingAMD ROCm Blogs. Don’t forget to sign up for the RSS feed to receive regular updates directly to your inbox.
Key Contributors:
Ronnie Chatterjee – Director Product Management
Jayacharan Kolla – Product Manager
Aditya Bhattacharji - Software Development Engineer
Saad Rahim – SMTS Software Development Engineer
Farshad Ghodsian -SMTS Product Application Eng.
Marco Grond - Developer Relations Manager
