STREAM Benchmark

AMD Zen Software Studio with Spack

Open MPI with AMD Zen Software Studio

Micro Benchmarks/Synthetic Benchmarks

Spack HPC Applications

Introduction

The STREAM benchmark is a simple, synthetic benchmark program that measures sustainable main memory bandwidth in MB/s and the corresponding computation rate for simple vector kernels.

The general rule for running STREAM is that each array must be at least 4x the size of the sum of all the last-level caches used in the run, or 1 million elements, whichever is larger.

STREAM uses four kernels for analysis:

"Copy'' measures transfer rates in the absence of arithmetic.
"Scale'' adds a simple arithmetic operation.
"Sum'' adds a third operand to allow multiple load/store ports on vector machines to be tested.
"Triad'' allows chained/overlapped/fused multiply/add operations.

Official Website: https://www.cs.virginia.edu/stream/

Build STREAM using Spack

Please refer to this link for getting started with spack using AMD Zen Software Studio

    # Building STREAM with double datatype, 5.2GB arrays and 100 iterations
# for a dual socket AMD 5th Gen EPYC™ Processor with 1GB of cumulative L3cache
$ spack install stream +openmp %aocc stream_array_size=650000000 ntimes=100 stream_type=double
 
# Building STREAM with double datatype, 3.4GB arrays and 100 iterations
# for a dual socket AMD 4th Gen EPYC™ Processor with 768MB of cumulative L3cache
$ spack install stream +openmp %aocc stream_array_size=430080000 ntimes=100 stream_type=double

Explanation of the command options:

Symbol	Meaning
+openmp	Enable building with OpenMP.
%aocc	Build STREAM using AOCC compiler.
stream_array_size	Array size for the STREAM benchmark. General recommendation is that “STREAM_ARRAY_SIZE” must be at least 4x the size of the sum of all the last-level caches in the system.
ntimes	Iteration count to run each kernel in STREAM.
stream_type	float \| double Choose single or double datatype for stream arrays.

Running STREAM

STREAM benchmark is used to calculate the memory bandwidth of a system. On the AMD EPYC™ SKU the maximum memory bandwidth can typically be achieved by under populating the number of threads in a core complex (group of cores with shared L3 cache). Using one or two or four threads per CCX is typically optimal depending upon the processor architecture, and SMT (hardware multithreading) should be switched off. Depending on the AMD EPYC™ SKU that is being used, this will require a different OpenMP core layout configuration. For example, AMD EPYC™ 1st Generation (Naples) CPUs have 4 cores per L3 cache, while most 2nd-4th Generation AMD EPYC™ CPUs have 8 cores per L3 cache, but some AMD EPYC™ Frequency Optimized CPUs have smaller numbers of cores, such as the 7xF3 family of Frequency optimized SKUs.

Table below provides recommendation for OMP_PLACES to achieve different threads/CCX configuration depending upon the cores/CCX configuration of given processor.

Cores per CCX (L3 Cache)	Total Cores	OMP_PLACES for 1 thread/CCX	OMP_PLACES for 2 threads/CCX	OMP_PLACES for 4 threads/CCX	Example AMD CPU SKU
1	8	0:8:1	N/A	N/A	AMD EPYC™ 72F3
2	16	0:8:2	0:16:1	N/A	AMD EPYC™ 73F3
3	24	0:8:3	0:8:3,1:8:3	N/A	AMD EPYC™ 74F3
4	32	0:8:4	0:16:2	0:32:1	AMD EPYC™ 75F3, AMD EPYC™ 7371
8	64	0:8:8	0:16:4	0:32:2	AMD EPYC™ 7763
8	96	0:12:8	0:24:4	0:48:2	AMD EPYC™ 9654
8	128	0:16:8	0:32:4	0:64:2	AMD EPYC™ 9755

Run Script for AMD EPYC™ Processors

    #!/bin/bash
# Load STREAM build with AOCC into environment
# NOTE: if you have compiled multiple versions you may need to be more specific
# Spack will complain if your request is ambiguous and could refer to multiple
# packages. (https://spack.readthedocs.io/en/latest/basic_usage.html#ambiguous-specs)
spack load stream %aocc

# For optimal stream performance, it is recommended to set the following OS parameters (requires root/sudo access)
echo always > /sys/kernel/mm/transparent_hugepage/enabled     # Enable hugepages
echo always > /sys/kernel/mm/transparent_hugepage/defrag     # Enable hugepages
echo 3 > /proc/sys/vm/drop_caches                            # Clear caches to maximize available RAM
echo 1 > /proc/sys/vm/compact_memory                         # Rearrange RAM usage to maximise the size of free blocks

# Optimize OpenMP performance behavious
export OMP_SCHEDULE=static  # Disable dynamic loop scheduling
export OMP_PROC_BIND=TRUE   # Bind threads to specific resources
export OMP_DYNAMIC=false    # Disable dynamic thread pool sizing

# OMP_PLACES is used for binding OpenMP threads to cores
# See: https://www.openmp.org/spec-html/5.0/openmpse53.html

############# FOR AMD EPYC™ 9654 ##################
# For example, a dual socket AMD 4th Gen EPYC™ Processor with 192 (96x2) cores,
# with 4 threads per L3 cache: 96 total places, stride by 2 cores:
export OMP_PLACES=0:96:2
export OMP_NUM_THREADS=96
 
############# FOR AMD EPYC™ 9755 ##################
# For example, a dual socket AMD 5th Gen EPYC™ Processor with 256 (128x2) cores, 
# with 1 thread per L3 cache: 32 total places, stride by 8 cores:
export OMP_PLACES=0:32:8
export OMP_NUM_THREADS=32

# Running stream
stream_c.exe

Note: The above build and run steps are tested with STREAM-5.10 and AOCC-5.0.0 on Red Hat Enterprise Linux release 8.9 (Ootpa) using Spack v0.23.0.dev0 (commit id : 2da812cbad ).

For technical support on the tools, benchmarks and applications that AMD offers on this page and related inquiries, reach out to us at toolchainsupport@amd.com.

Data Center

Business Systems

Personal & Gaming

Embedded

Resources

GPU Accelerators

Adaptive Accelerators

DPU Accelerators

Ethernet Adapters

Workstations

Desktops

Laptops

Resources

Adaptive SoCs & FPGAs

System-on-Modules (SOMs)

Technologies

Resources

Evaluation Boards & Kits

Processor Tools

Graphics Tools & Apps

Adaptive SoC & FPGA Tools

Intellectual Property & Apps

GPU Accelerator Tools & Apps

Overview

For Data Center & Cloud

For Edge & Endpoints

For Developers

Industries

Industries

Industries

Industries

Industries

Workloads

Gaming

Systems

Technologies

Resources

EPYC Processors

Radeon Graphics & AMD Chipsets

Adaptive SoCs & FPGAs

Alveo Accelerators & Kria SOMs

Ryzen Processors

Ethernet Adapters

Overview

Processors

Accelerators

Adaptive SoCs, FPGAs, & SOMs

Graphics

Overview

Resources by Market Segment

Resources by Product

Resources by Type

About Our Partners

AMD Global Support

Processors & Graphics

Accelerators

Adaptive SoCs & FPGAs

Gaming & Personal Computing

Adaptive & Embedded Computing

Get AMD Fan Gear

Shop Our Retail Partners

Introduction

Build STREAM using Spack

Running STREAM

Company

News & Events

Community

Partners

Investors