Building Large Language Models with the power of AMD
TurkuNLP scaled to 192 nodes on the LUMI supercomputer, powered by AMD EPYC™ CPUs and AMD Instinct™ GPUs, to build Large Language Models for Finnish.
High performance servers are foundational to Enterprise AI. AMD EPYC™ server CPUs and leading GPUs deliver impressive performance for your AI training and large model workloads.
Live Webinar
Learn how the winning combination of AMD EPYC™ processors and industry-leading GPU accelerators provide the muscle needed to tackle the most demanding Enterprise AI challenges.
GPU accelerators have become the workhorse for modern AI, excelling in training large, complex models and supporting efficient real-time inference at scale. However, maximizing the potential of your GPU investment requires a powerful CPU partner.
GPUs are the right tool for many AI workloads.
Combining the power of GPUs with the right CPU can significantly enhance AI efficiency for certain workloads. Look for these key CPU features:
Your ideal choice for unlocking the true potential of your large AI workloads. They help maximize GPU accelerator performance and overall AI workload efficiency. Plus, with advanced security features and a long, consistent commitment to open standards, AMD EPYC processors enable businesses to confidently deploy the next phase in their AI journey.
GPU accelerator-based solutions fueled by AMD EPYC CPUs power many of the world's fastest supercomputers and cloud instances, offering enterprises a proven platform for optimizing data-driven workloads and achieving groundbreaking results in AI.
CPUs play a crucial role in orchestrating and synchronizing data transfers between GPUs, handling kernel launch overheads, and managing data preparation. This "conductor" function ensures that GPUs operate at peak efficiency.
Some workloads benefit from high CPU clock speeds to enhance GPU performance by streamlining data processing, transfer, and concurrent execution, fueling GPU efficiency.
To prove the concept that higher CPU frequencies boost Llama2-7B workload throughput, we used custom AMD EPYC 9554 CPUs in a 2P server equipped with 8x NVIDIA H100 GPUs1
Processors that combine high performance, low power consumption, efficient data handling, and effective power management capabilities enable your AI infrastructure to operate at peak performance while optimizing energy consumption and cost.
AMD EPYC processors power the world’s most energy-efficient servers, delivering exceptional performance and helping reduce energy costs.2 Deploy them with confidence to create energy-efficient solutions and help optimize your AI journey.
In AMD EPYC 9004 Series processors AMD Infinity Power Management offers excellent default performance and allows fine-tuning for workload-specific behavior.
Choose from several certified or validated GPU-accelerated solutions hosted by AMD EPYC CPUs to supercharge your AI workloads.
Prefer AMD Instinct accelerator-powered solutions?
Using other GPUs? Ask for AMD EPYC CPU powered solutions available from leading platform solution providers including Asus, Dell, Gigabyte, HPE, Lenovo, and Supermicro.
Ask for instances combining AMD EPYC CPU with GPUs for AI/ML workloads from major cloud providers including AWS, Azure, Google, IBM Cloud, and OCI.
Server configurations: 2P EPYC 9554 (CPU with customized frequencies, 64C/128T, 16 cores active), 1.5TB Memory (24x 64GB DDR5-5600 running at 4800 MT/s), 3.2 TB SSD, Ubuntu® 22.04.4 LTS, with 8x NVIDIA H100 80GB HBM3, HuggingFace Transformers v 4.31.0, NVIDIA PyTorch 23.12, PEFT 0.4.0, Python 3.10.12, CUDA 12.3.2.001, TensorRT-LLM v 0.9.0.dev2024, CUDNN 8.9.7.29+cuda12.2, NVIDIA-SMI Driver version 550.54.15, TRT v8.6.1.6+cuda12.0.1.011, Transformer Engine v1.1
Llama2-7B Fine Tuning: BS per device=4, seqln=128, avg over 4 runs, 10 epochs per run, FP16
Llama2-7B Training (1K): BS=56 (7x8 GPUs), seqln=1k, Gradients on GPU
Llama2-7B Training (2K): BS=24 (3x8 GPUs), seqln=2k, Gradients on GPU
Results:
CPU Freq 2000 MHz 2500 MHz 3000 MHz
Fine Tuning Avg Train Run Time Seconds 649.38 584.24 507.1
% Throughput Increase 0.00% 11.15% 28.06%
Training Throughput 1K Sequence Length 276.08 238.81 230.82
% Throughput Increase 0.00% 15.61% 19.61%
Training Throughput 2K Sequence Length 883.85 807.94 778.72
% Throughput Increase 0.00% 9.40% 13.50%
Results may vary due to factors including system configurations, software versions and BIOS settings. NOTE: This performance is Proof of Concept. Data collected on 2P custom AMD EPYC™ 9554 as Host Processor with various frequencies utilizing 8x Nvidia H100 80GB accelerators. 4th Gen EPYC Processors do not allow end users to adjust frequencies