Tools/perf/example_stream
Example perf: stream
Build
stream
benchmarkmodule purge module add compiler/gnu gcc -std=c11 -Ofast -march=native -fopenmp \ -g \ -o stream stream.OpenMP.c
Set up perftools and OpenMP environment
module add devel/perf export OMP_NUM_THREADS=76 export OMP_DISPLAY_ENV=VERBOSE export OMP_PLACES=cores
Record performance data of benchmark
stream
for use withperf report
andperf annotate
perf record ./stream -n 2500000000
OPENMP DISPLAY ENVIRONMENT BEGIN _OPENMP = '201511' OMP_DYNAMIC = 'FALSE' OMP_NESTED = 'FALSE' OMP_NUM_THREADS = '76' OMP_SCHEDULE = 'DYNAMIC' OMP_PROC_BIND = 'TRUE' OMP_PLACES = '{0,76},{1,77},{2,78},{3,79},{4,80},{5,81},{6,82},{7,83},{8,84},{9,85},{10,86},{11,87},{12,88},{13,89},{14,90},{15,91},{16,92},{17,93},{18,94},{19,95},{20,96},{21,97},{22,98},{23,99},{24,100},{25,101},{26,102},{27,103},{28,104},{29,105},{30,106},{31,107},{32,108},{33,109},{34,110},{35,111},{36,112},{37,113},{38,114},{39,115},{40,116},{41,117},{42,118},{43,119},{44,120},{45,121},{46,122},{47,123},{48,124},{49,125},{50,126},{51,127},{52,128},{53,129},{54,130},{55,131},{56,132},{57,133},{58,134},{59,135},{60,136},{61,137},{62,138},{63,139},{64,140},{65,141},{66,142},{67,143},{68,144},{69,145},{70,146},{71,147},{72,148},{73,149},{74,150},{75,151}' OMP_STACKSIZE = '0' OMP_WAIT_POLICY = 'PASSIVE' OMP_THREAD_LIMIT = '4294967295' OMP_MAX_ACTIVE_LEVELS = '2147483647' OMP_CANCELLATION = 'FALSE' OMP_DEFAULT_DEVICE = '0' OMP_MAX_TASK_PRIORITY = '0' OMP_DISPLAY_AFFINITY = 'FALSE' OMP_AFFINITY_FORMAT = 'level %L thread %i affinity %A' GOMP_CPU_AFFINITY = '' GOMP_STACKSIZE = '0' GOMP_SPINCOUNT = '300000' OPENMP DISPLAY ENVIRONMENT END
------------------------------------------------------------- STREAM version $Revision: 5.10 $ ------------------------------------------------------------- This system uses 8 bytes per array element. ------------------------------------------------------------- Array size = 2499999936 (elements) Memory per array = 19073.5 MiB (= 18.6 GiB). Total memory required = 57220.5 MiB (= 55.9 GiB). Each kernel will be executed 10 times. The *best* time for each kernel (excluding the first iteration) will be used to compute the reported bandwidth. ------------------------------------------------------------- OpenMP version (yyyymm): 201511 Number of Threads requested = 76 Number of Threads counted = 76 ------------------------------------------------------------- Your clock granularity appears to be 1000 ticks per microseconds. Each test below will take on the order of 138796 microseconds. (= 138796910 clock ticks) Increase the size of the arrays if this shows that you are not getting at least 20 clock ticks per test. ------------------------------------------------------------- WARNING -- The above is only a rough guideline. For best results, please be sure you know the precision of your system timer. ------------------------------------------------------------- Function Best Rate MB/s Med time Min time Max time Copy: 294387.4 0.136475 0.135875 0.138270 Scale: 296457.2 0.135449 0.134927 0.142989 Add: 306837.1 0.195771 0.195544 0.196339 Triad: 305035.9 0.197035 0.196698 0.206095 ------------------------------------------------------------- Solution Validates: avg error less than 1.000000e-13 on all three arrays -------------------------------------------------------------
[ perf record: Woken up 9 times to write data ] [ perf record: Captured and wrote 91.847 MB perf.data (2407320 samples) ]
Create performance report
perf report
27.13% stream stream [.] tuned_STREAM_Triad._omp_fn.0 27.09% stream stream [.] tuned_STREAM_Add._omp_fn.0 19.45% stream libc-2.28.so [.] __memcpy_avx_unaligned_erms 18.52% stream stream [.] tuned_STREAM_Scale._omp_fn.0 ...
- Interactive navigation in performance report:
h
: get helpa
: jump to annotated assembler code
- Interactive navigation in performance report:
Create report with annotated source
perf annotate --source --disassembler-style=intel
... // Instructs the compiler to use non-temporal (that is, streaming) stores #pragma vector nontemporal #endif #pragma omp simd aligned (a, b, c : alignment_bytes) for (long int j = 0; j < STREAM_ARRAY_SIZE_thread; j++) a[j] = b[j] + scalar * c[j]; ... 48.54 │50: vmovupd ymm1,YMMWORD PTR [rsi+rax*1] 48.71 │ vfmadd213pd ymm1,ymm2,YMMWORD PTR [rcx+rax*1] 1.33 │ vmovupd YMMWORD PTR [rdx+rax*1],ymm1 0.00 │ add rax,0x20 │ cmp rax,r8 1.42 │ ↑ jne 50 ... Source file location: <...>/stream.OpenMP.c:387 ...
- Interactive navigation in annoteted assembler code:
h
: get helpH
: jump to hottest place (most often called place)k
: toggle line number view on/offs
: toggle source code view on/off
- Assembler code:
vfmadd
: Vector fused-multiply-addvmovupd
: Vector move unaligned packed double-precision floating-point values
- Interactive navigation in annoteted assembler code:
Why does this example use GNU compiler instead of Intel Compiler?
- Intel compiler inlines all functions -> All time is spent in main-function
- Intel heavily optimizes assembler code -> Matching between assembler instruction and C source code line is difficult