Example perf: stream
- Build
stream
benchmarkmodule purge module add compiler/gnu/8 gcc -std=c11 -Ofast -march=native -flto -fopenmp \ -g \ stream.c -o stream
- Set up OpenMP environment
export OMP_NUM_THREADS=20 export OMP_DISPLAY_ENV=VERBOSE export OMP_PLACES=cores
- Record performance data of benchmark
stream
for use withperf report
andperf annotate
perf record ./stream -n 2500000000
OPENMP DISPLAY ENVIRONMENT BEGIN _OPENMP = '201511' OMP_DYNAMIC = 'FALSE' OMP_NESTED = 'FALSE' OMP_NUM_THREADS = '20' OMP_SCHEDULE = 'DYNAMIC' OMP_PROC_BIND = 'TRUE' OMP_PLACES = '{0,20},{1,21},{2,22},{3,23},{4,24},{5,25},{6,26},{7,27},{8,28},{9,29},{10,30},{11,31},{12,32},{13,33},{14,34},{15,35},{16,36},{17,37},{18,38},{19,39}' OMP_STACKSIZE = '0' OMP_WAIT_POLICY = 'PASSIVE' OMP_THREAD_LIMIT = '4294967295' OMP_MAX_ACTIVE_LEVELS = '2147483647' OMP_CANCELLATION = 'FALSE' OMP_DEFAULT_DEVICE = '0' OMP_MAX_TASK_PRIORITY = '0' GOMP_CPU_AFFINITY = '' GOMP_STACKSIZE = '0' GOMP_SPINCOUNT = '300000' OPENMP DISPLAY ENVIRONMENT END
------------------------------------------------------------- STREAM version $Revision: 5.10 $ ------------------------------------------------------------- This system uses 8 bytes per array element. ------------------------------------------------------------- Array size = 2500000000 (elements) (elements) Memory per array = 19073.5 MiB (= 18.6 GiB). Total memory required = 57220.5 MiB (= 55.9 GiB). Each kernel will be executed 10 times. The *best* time for each kernel (excluding the first iteration) will be used to compute the reported bandwidth. ------------------------------------------------------------- Number of Threads requested = 20 Number of Threads counted = 20 ------------------------------------------------------------- Your clock granularity/precision appears to be 1 microseconds. Each test below will take on the order of 386554 microseconds. (= 386554 clock ticks) Increase the size of the arrays if this shows that you are not getting at least 20 clock ticks per test. ------------------------------------------------------------- WARNING -- The above is only a rough guideline. For best results, please be sure you know the precision of your system timer. ------------------------------------------------------------- Function Best Rate MB/s Med time Min time Max time Copy: 99789.3 0.484283 0.400845 0.519110 Scale: 74203.0 0.608462 0.539061 0.636535 Add: 77438.6 0.790049 0.774808 0.801701 Triad: 79069.4 0.788340 0.758827 0.814521 ------------------------------------------------------------- Solution Validates: avg error less than 1.000000e-13 on all three arrays -------------------------------------------------------------
[ perf record: Woken up 85 times to write data ] [ perf record: Captured and wrote 94.379 MB perf.data (2467598 samples) ]
- Create performance report
perf report
23.90% stream stream [.] tuned_STREAM_Triad._omp_fn.14 23.78% stream stream [.] tuned_STREAM_Add._omp_fn.13 17.90% stream stream [.] tuned_STREAM_Scale._omp_fn.12 13.40% stream libc-2.17.so [.] __memcpy_ssse3 ...
- Interactive navigation in performance report:
h
: get helpa
: jump to annotated assembler code
- Interactive navigation in performance report:
- Create report with annotated source
perf annotate
... │592 // Instructs the compiler to use non-temporal (that is, streaming) stores │593 #pragma vector nontemporal │594 #endif │595 #pragma omp simd aligned (a, b, c : alignment_bytes) │596 for (long int j = 0; j < STREAM_ARRAY_SIZE_thread; j++) │597 a[j] = b[j] + scalar * c[j]; ... 0.49 │ e0:┌─→vmovup ymm0,YMMWORD PTR [r11+rax*1] 44.81 │ │ vfmadd ymm0,ymm2,YMMWORD PTR [r13+rax*1+0x0] 49.67 │ │ add rcx,0x1 1.79 │ │ vmovup YMMWORD PTR [rdx+rax*1],ymm0 2.82 │ │ add rax,0x20 0.42 │ │ cmp rcx,r12 0.00 │ └──jb e0 ...
- Interactive navigation in annoteted assembler code:
h
: get helpH
: jump to hottest place (most often called place)k
: toggle line number view on/offs
: toggle source code view on/off
- Assembler code:
vfmadd
: Vector fused-multiply-addvmovupd
: Vector move unaligned packed double-precision floating-point values
- Interactive navigation in annoteted assembler code:
Why does this example use GNU compiler instead of Intel Compiler?
- Intel compiler inlines all functions -> All time is spent in main-function
- Intel heavily optimizes assembler code -> Matching between assembler instruction and C source code line is difficult
Last modified 5 days ago
Last modified on Apr 5, 2019, 1:25:56 PM