Scalable AI, Served Fresh: The AMD Blueprint for High-Performance AI/HPC Infrastructure
Dec 17, 2024

The AI era is here, and it's hungry—hungry for performance, scalability, and efficiency. Whether you're building next-gen data centers, fine-tuning your AI/ML workloads, or crafting cutting-edge HPC solutions, one thing is clear: the right ingredients matter. Enter AMD Instinct™ MI300 Series GPU accelerators and their newly published Cluster Design Reference Architecture Guide—the ultimate recipe for scalable AI/HPC success.
This blog will guide you through the AMD recommended ingredients, the secret sauce, and the cooking techniques needed to create an AI/HPC infrastructure that’s as efficient as it is powerful. Let’s get cooking.
Ingredient 1: High-Performance Compute Nodes
Every great dish starts with the key ingredients. In the AMD AI kitchen, those ingredients are the AMD MI300 Series Compute Nodes. With 8 AMD Instinct MI300X or MI325X Series GPU accelerators per node, powered by 4th-Gen AMD Infinity Fabric™, you get exceptional interconnect bandwidth and low-latency performance.
Each node is a powerhouse, equipped with dual AMD EPYC™ host processors, up to 6TB of DDR5 memory, and ultra-fast NVMe SSDs. Seamless connectivity is ensured through 8 high-performance AMD Pensando™ Pollara 400G Network Interface Cards (NICs). These NICs support industry-standard RDMA over Converged Ethernet (RoCEv2) and UEC Ready RDMA, featuring a cutting-edge programmable P4 engine that fine-tunes transport and congestion control. The result? Optimized data transfers between GPUs for peak efficiency and ultra-low latency. Internally, the 8 AMD Instinct GPUs are linked via the AMD Infinity Fabric for high-speed, low-latency communication and connect to host CPUs through PCIe Gen5, ensuring robust performance at every level. Each node also houses two DSC3-400 Distributed Services Cards, the third-generation AMD Pensando DPUs, delivering essential SDN, security, storage offloads, and telemetry services to support the escalating demands of AI workloads.
Figure 1: Accelerator Mesh network with AMD Instinct MI300 Series accelerators
Think of these nodes as your premium proteins—each one designed to help maximize the training and inferencing potential of your AI models. From deep neural networks to high-throughput HPC tasks, this ingredient ensures your infrastructure delivers the flavor of performance you need.
For our AI/ML engineers, this means reduced training times and enhanced performance for inferencing, enabling fast, more efficient real-time predictions. For IT decision-makers, it’s an investment in efficiency and scalability. And for data center architects? These nodes are the building blocks of AI alchemy.
Ingredient 2: A Robust Network Fabric
A feast isn’t complete without the right seasoning, and in this case, it’s a robust network fabric. The AMD recipe calls for frontend, backend, and accelerator networks, each carefully designed to meet the demands of massive AI workloads.
Figure 2: Frontend, Accelerator and Backend Networks
- High-Speed Accelerator Network: A mesh of GPUs connected via AMD Infinity Fabric, enabling high-speed, low-latency inter-GPU communication.
- Backend Scale-Out Network: Built using PCIe 5.0 NICs, RDMA over Converged Ethernet (RoCEv2, or UEC Ready RDMA, this network scales clusters while minimizing congestion and managing traffic with precision.
- Frontend Network: Handles data ingestion, storage, in-band management, secure user access, and multi-tenancy, helping ensure seamless operations.
For HPC enthusiasts and researchers, these fabrics help minimize congestion and maximize throughput. For AI/ML engineers, the network fabric enables seamless scalability for training large models and accelerates inferencing workflows by helping ensure fast, efficient GPU-to-GPU and node-to-node communication. For system integrators, the modularity of fat tree or rail topology means you can adapt the network design to fit any use case, adapting to any topology like fat tree, rail, or even Dragonfly, making deployment a breeze.
Ingredient 3: The Open-Source Software Platform
No recipe is complete without the sauce that ties everything together. Enter the AMD ROCm™ Platform—an open-source toolkit designed to unleash the full potential of heterogeneous architectures. From low-level kernel programming to end-user applications, ROCm’s flexibility and transparency cater to developers who demand control and scalability.
Key Feature Highlight: HIPIFY translates CUDA code for AMD GPUs, while optimized binaries for PyTorch, TensorFlow, and ONNX ensure compatibility. RDC and SMI simplify cluster management, while ROCm provides end-to-end optimization tools.
Proven Standards: AMD invests in open standards and ecosystem collaboration, ensuring that developers, OEMs, and partners can innovate seamlessly on AMD platforms.
AI Models and Algorithms (PyTorch, TensorFlow, ONNX) |
AI Ecosystem Optimized for AMD |
Workflow orchestration and Job Scheduling |
Lamini, Slurm, Kubernetes |
Cluster Management |
Container Applications (Redhat Openshift) |
Data Center Management |
AMD ROCm (ROCm SMI ROCm Data Center Tool) |
Hardware |
AMD Instinct MI300 Series GPU Accelerators |
Figure 3: Software Stack with AMD ROCm, Container and Infrastructure Blocks
With ROCm, developers can fine-tune workloads, while partners and OEMs can integrate seamlessly with AMD to create innovative solutions. It’s the secret ingredient that enhances every bite.
Ingredient 4: Modular Design for Any Scale
Cooking for one or a banquet? No problem. The AMD cluster design is as modular as a recipe that scales from home cooking to catering for thousands. Each scalable unit consists of up to 64 nodes with a 128-port switch, enabling flexible deployment at any scale. Whether you’re operating a small cluster for inferencing or deploying a huge supercomputer for training, AMD has you covered.
This modularity is important for data center architects who can design racks and nodes that fit their exact power and cooling requirements. For IT leaders, it means investment protection as your AI ambitions grow. And for system integrators, it’s a toolkit that makes deploying solutions as easy as possible.
Cooking Technique: The Blueprint
No great chef wings it, and no great AI architect should either. The new AMD Instinct MI300 Series Cluster Reference Guide is your recipe book for success. It lays out step-by-step instructions for building a scalable, efficient AI cluster that meets your exact needs.
From detailed topology designs (fat tree, rail) to hardware recommendations (NICs, switches, storage), this guide takes the guesswork out of scaling AI infrastructure. For integrators, researchers, and decision-makers alike, this blueprint is the bridge between ambition and execution.
Why the AMD Recipe Stands Out
While other formulas are in play and some still refining their approaches, AMD has mastered the art of balance, delivering proven solutions that help empower leading customers to achieve their AI ambitions with confidence and success.
- Performance: Industry-leading GPUs, memory, and interconnect speeds helping ensure every workload is optimized.
- Flexibility: Open standards and modular designs mean you’re not locked into one approach.
- Support: AMD partners with PyTorch, Hugging Face, Databricks, Lamini, Dell, Lenovo, and more, creating a rich ecosystem of solutions.
This isn’t just a recipe—it’s a culinary revolution for the AI era.
The Final Dish: A Scalable AI Feast
Imagine an infrastructure that seamlessly scales as your AI ambitions grow. A cluster where every GPU, every node, and every network is optimized for peak performance. That’s what AMD delivers—a feast for your data center that’s ready to tackle the challenges of AI and HPC.
Whether you're a data center architect designing the perfect layout, an AI researcher optimizing model training or inference, or an IT leader making strategic investments, AMD has the solution that satisfies every need.
Ready to Bring this Recipe to Life?
Download the AMD Instinct MI300 Series Cluster Reference Guide today and discover the in-depth blueprint for building scalable, high-performance AI clusters. Don’t just follow the AI trend—lead it with AMD.
Your future-proof AI infrastructure awaits. Let’s work towards your success together.
Chief Contributors:
Soumen Karmakar – Sr. Product Manager
Jay Bennett - Product Management Director
Key Contributors:
Jason Gmitter - Technical Marketing Director
Azeem Suleman - Technical Marketing Director
DISCLAIMER: The information contained herein is for informational purposes only and is subject to change without notice. While every precaution has been taken in the preparation of this document, it may contain technical inaccuracies, omissions and typographical errors, and AMD is under no obligation to update or otherwise correct this information. Advanced Micro Devices, Inc. makes no representations or warranties with respect to the accuracy or completeness of the contents of this document, and assumes no liability of any kind, including the implied warranties of noninfringement, merchantability or fitness for particular purposes, with respect to the operation or use of AMD hardware, software or other products described herein. No license, including implied or arising by estoppel, to any intellectual property rights is granted by this document. Terms and limitations applicable to the purchase or use of AMD products are as set forth in a signed agreement between the parties or in AMD's Standard Terms and Conditions of Sale. GD-18u.
© 2024 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo, EPYC, Instinct, Pensando, Infinity Fabric, ROCm, and combinations thereof are trademarks of Advanced Micro Devices, Inc. Other product names used in this publication are for identification purposes only and may be trademarks of their respective owners. Certain AMD technologies may require third-party enablement or activation. Supported features may vary by operating system. Please confirm with the system manufacturer for specific features. No technology or product can be completely secure.
