Introduction

The STREAM benchmark is a simple, synthetic benchmark program that measures sustainable main memory bandwidth in MB/s and the corresponding computation rate for simple vector kernels.

The general rule for running STREAM is that each array must be at least 4x the size of the sum of all the last-level caches used in the run, or 1 million elements, whichever is larger. 

STREAM uses four kernels for analysis:

  1. "Copy'' measures transfer rates in the absence of arithmetic.
  2. "Scale'' adds a simple arithmetic operation.
  3. "Sum'' adds a third operand to allow multiple load/store ports on vector machines to be tested.
  4. "Triad'' allows chained/overlapped/fused multiply/add operations.

Official Websitehttps://www.cs.virginia.edu/stream/

 

Build STREAM using Spack

Please refer to this link for getting started with spack using AMD Zen Software Studio

    # Building STREAM with double datatype, 5.2GB arrays and 100 iterations
# for a dual socket AMD 5th Gen EPYC™ Processor with 1GB of cumulative L3cache
$ spack install stream +openmp %aocc stream_array_size=650000000 ntimes=100 stream_type=double
 
# Building STREAM with double datatype, 3.4GB arrays and 100 iterations
# for a dual socket AMD 4th Gen EPYC™ Processor with 768MB of cumulative L3cache
$ spack install stream +openmp %aocc stream_array_size=430080000 ntimes=100 stream_type=double

Explanation of the command options:

Symbol Meaning
+openmp Enable building with OpenMP.
%aocc Build STREAM using AOCC compiler.
stream_array_size Array size for the STREAM benchmark.
General recommendation is that “STREAM_ARRAY_SIZE” must be at least 4x the size of the sum of all the last-level caches in the system.
ntimes Iteration count to run each kernel in STREAM.
stream_type float | double
Choose single or double datatype for stream arrays.

Running STREAM

STREAM benchmark is used to calculate the memory bandwidth of a system. On the AMD EPYC™ SKU the maximum memory bandwidth can typically be achieved by under populating the number of threads in a core complex (group of cores with shared L3 cache).  Using one or two or four threads per CCX is typically optimal depending upon the processor architecture, and SMT (hardware multithreading) should be switched off. Depending on the AMD EPYC™ SKU that is being used, this will require a different OpenMP core layout configuration.  For example, AMD EPYC™ 1st Generation (Naples) CPUs have 4 cores per L3 cache, while most 2nd-4th Generation AMD EPYC™ CPUs have 8 cores per L3 cache, but some AMD EPYC™ Frequency Optimized CPUs have smaller numbers of cores, such as the 7xF3 family of Frequency optimized SKUs.

Table below provides recommendation for OMP_PLACES to achieve different threads/CCX configuration depending upon the cores/CCX configuration of given processor.

Cores per CCX (L3 Cache) Total Cores OMP_PLACES for 1 thread/CCX OMP_PLACES for 2 threads/CCX OMP_PLACES for 4 threads/CCX Example AMD CPU SKU
1 8 0:8:1 N/A N/A AMD EPYC™ 72F3
2 16 0:8:2 0:16:1 N/A AMD EPYC™ 73F3
3 24 0:8:3 0:8:3,1:8:3 N/A AMD EPYC™ 74F3
4 32 0:8:4 0:16:2 0:32:1 AMD EPYC™ 75F3, AMD EPYC™ 7371
8 64 0:8:8 0:16:4 0:32:2 AMD EPYC™ 7763
8 96 0:12:8 0:24:4 0:48:2 AMD EPYC™ 9654
8 128 0:16:8 0:32:4 0:64:2 AMD EPYC™ 9755

 

Run Script for AMD EPYC™ Processors

    #!/bin/bash
# Load STREAM build with AOCC into environment
# NOTE: if you have compiled multiple versions you may need to be more specific
# Spack will complain if your request is ambiguous and could refer to multiple
# packages. (https://spack.readthedocs.io/en/latest/basic_usage.html#ambiguous-specs)
spack load stream %aocc

# For optimal stream performance, it is recommended to set the following OS parameters (requires root/sudo access)
echo always > /sys/kernel/mm/transparent_hugepage/enabled     # Enable hugepages
echo always > /sys/kernel/mm/transparent_hugepage/defrag     # Enable hugepages
echo 3 > /proc/sys/vm/drop_caches                            # Clear caches to maximize available RAM
echo 1 > /proc/sys/vm/compact_memory                         # Rearrange RAM usage to maximise the size of free blocks

# Optimize OpenMP performance behavious
export OMP_SCHEDULE=static  # Disable dynamic loop scheduling
export OMP_PROC_BIND=TRUE   # Bind threads to specific resources
export OMP_DYNAMIC=false    # Disable dynamic thread pool sizing

# OMP_PLACES is used for binding OpenMP threads to cores
# See: https://www.openmp.org/spec-html/5.0/openmpse53.html

############# FOR AMD EPYC™ 9654 ##################
# For example, a dual socket AMD 4th Gen EPYC™ Processor with 192 (96x2) cores,
# with 4 threads per L3 cache: 96 total places, stride by 2 cores:
export OMP_PLACES=0:96:2
export OMP_NUM_THREADS=96
 
############# FOR AMD EPYC™ 9755 ##################
# For example, a dual socket AMD 5th Gen EPYC™ Processor with 256 (128x2) cores, 
# with 1 thread per L3 cache: 32 total places, stride by 8 cores:
export OMP_PLACES=0:32:8
export OMP_NUM_THREADS=32

# Running stream
stream_c.exe

Note: The above build and run steps are tested with STREAM-5.10 and AOCC-5.0.0 on Red Hat Enterprise Linux release 8.9 (Ootpa) using Spack v0.23.0.dev0 (commit id : 2da812cbad ).

For technical support on the tools, benchmarks and applications that AMD offers on this page and related inquiries, reach out to us at toolchainsupport@amd.com