AMD Zen Software Studio with Spack
- AMD Optimizing C/C++ Compiler (AOCC)
- AMD Optimizing CPU Libraries (AOCL)
- AMD uProf
- Setting Preference for AMD Zen Software Studio
Open MPI with AMD Zen Software Studio
Micro Benchmarks/Synthetic Benchmarks
Spack HPC Applications
Introduction
The High-Performance Conjugate Gradients (HPCG) Benchmark project is an effort to create a new metric for ranking HPC systems. HPCG is intended as a complement to the High Performance LINPACK (HPL) benchmark, currently used to rank the TOP500 computing systems. The computational and data access patterns of HPL are still representative of some important scalable applications, but not all. HPCG is designed to exercise computational and data access patterns that more closely match a different and broad set of important applications, and to give incentive to computer system designers to invest in capabilities that will have impact on the collective performance of these applications.
Official Website: https://www.hpcg-benchmark.org/
For best benchmark scores on AMD ZEN architectures, we recommend using Zen HPCG binaries which are optimized for EPYC platforms. For further details, refer to AMD Zen HPCG.
Build HPCG using Spack
Please refer to this link for getting started with spack using AMD Zen Software Studio
# Example for building HPCG with AOCC
$ spack install hpcg %aocc +openmp ^openmpi fabrics=cma,ucx
Explanation of the command options:
Symbol | Meaning |
---|---|
+openmp | Enables building with OpenMP |
%aocc | Build HPCG with AOCC compiler |
^openmpi fabrics=cma,ucx | Use OpenMPI as the MPI provider and use the CMA network for efficient intra-node communication, falling back to the UCX network fabric, if required. Note: It is advised to specifically set the appropriate fabric for the host system if possible. Refer to Open MPI with AMD Zen Software Studio for more guidance. |
Configuring HPCG Run Parameters
HPCG problem size can be configured through hpcg.dat file or command line arguments.
To produce results that comply with the rules for valid official HPCG runs (e.g. for leaderboard submissions), HPCG runs must be configured to meet the following criteria:
- Problem size - (Line 3) This is the size of the local matrix for each rank, therefore for a fixed problem size, memory usage will still scale with the number of MPI ranks. A valid run must execute a problem size that is large enough that data arrays accessed in the CG iteration loop do not fit fully in CPU cache, which would prevent the DRAM latency and bandwidth effects from affecting the problem. HPCG guidelines state that the problem size should be large enough to occupy at least 25% of the main memory.
- Run time - (Line 4) HPCG can be run in just a few minutes from start to finish. Our testing suggests 60 seconds is enough to produce consistent results, however, official runs must be at least 1800 seconds (30 minutes) as reported in the output file.
Sample hpcg.dat file for dual socket AMD 5th Gen EPYC™ 9755 Processor with 256 (128x2) cores and using 512 GB of memory.
hpcg.dat
HPCG benchmark input file
Comment line - may contain any useful string
192 192 192 # dimensions of local (per rank) 3D matrix
1800 # Minimum runtime in seconds
Running HPCG
HPCG is run using mpirun in the standard way for an MPI or MPI+OpenMP application. If you have built HPCG with OpenMP support (recommended for best performance), it should be launched with suitable mapping to ensure that the OpenMP threads of each MPI rank do not span across different CCXs (in an AMD EPYC™ CPU a CCX is a group of cores which share an L3 cache and other memory hardware), as this will degrade performance. Systems should also be configured with SMT (hardware multithreading) switched off.
With OpenMPI this is achieved using a map-by statement. For example, to run HPCG with 2 OpenMP threads per rank on an AMD EPYC™ Gen 2 or later CPU:
Run Script for AMD EPYC™ Processors
#!/bin/bash
# Loading HPCG built with AOCC
spack load hpcg %aocc
# Group of cores which share an L3 cache (or CCX) is 8 for most EPYC Gen 2-5 CPUs, 4 for EPYC Gen 1
# For frequency optimised "F-parts", check documentation
CORES_PER_L3CACHE=8
NUM_CORES=$(nproc)
# OpenMP Settings
export OMP_PROC_BIND=true
export OMP_PLACES=cores
export OMP_NUM_THREADS=2
# MPI settings
MPI_RANKS=$(( $NUM_CORES / $OMP_NUM_THREADS ))
RANKS_PER_L3CACHE=$(( $CORES_PER_L3CACHE / $OMP_NUM_THREADS ))
MPI_OPTS=“-np $MPI_RANKS --bind-to core --map-by ppr:$RANKS_PER_L3CACHE:l3cache:pe=$OMP_NUM_THREADS ”
# Run HPCG
mpirun $MPI_OPTS xhpcg
Note: Users should update the value of CORES_PER_L3CACHE to match that of the CPU they are using, for example CORES_PER_L3CACHE=4 for AMD EPYC™ Gen 1 (Naples) CPUs. Users of Frequency Optimized AMD EPYC™ CPUs ("F parts") should refer to product documentation to find the appropriate value for their specific CPU SKU.
Note: The above build and run steps are tested with HPCG-3.1, AOCC-5.0.0, and OpenMPI-5.0.5 on Red Hat Enterprise Linux release 8.9 (Ootpa) using Spack v0.23.0.dev0 (commit id : 2da812cbad ).
For technical support on the tools, benchmarks and applications that AMD offers on this page and related inquiries, reach out to us at toolchainsupport@amd.com.