Tools/time/example_stream

Example /usr/bin/time: stream

Build benchmark

module add compiler/intel/19.1
icc -std=c11 -Ofast -xHost -ipo -qopenmp \
    stream.c -o stream

Serial execution

export OMP_NUM_THREADS=1
/usr/bin/time ./stream -n 1000000000
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 1000000000 (elements)
Memory per array = 7629.4 MiB (= 7.5 GiB).
Total memory required = 22888.2 MiB (= 22.4 GiB).
Each kernel will be executed 10 times.
 The *best* time for each kernel (excluding the first iteration)
 will be used to compute the reported bandwidth.
-------------------------------------------------------------
OpenMP version (yyyymm): 201611
Number of Threads requested = 1
Number of Threads counted = 1
-------------------------------------------------------------
Your clock granularity appears to be 1000 ticks per microseconds.
Each test below will take on the order of 696360 microseconds.
   (= 696360085 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function    Best Rate MB/s  Med time     Min time     Max time
Copy:           21819.6     0.734463     0.733285     0.735423
Scale:          21213.1     0.755385     0.754252     0.757456
Add:            18866.7     1.272457     1.272085     1.274740
Triad:          18830.8     1.275592     1.274509     1.280414
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------
48.67user 2.92system 0:51.69elapsed 99%CPU (0avgtext+0avgdata 23441340maxresident)k
2048inputs+0outputs (11major+52234minor)pagefaults 0swaps

What causes the high system time?: Memory page allocation and wiping!

Relation user, sys and elapsed time:

User time 48.67 seconds
+ Sys time 2.92 seconds
= 51.59 seconds
~ Elapsed time 0:51.69 minutes

Relation vector size and maxresident:

Number of arrays 3 vectors a, b, c
* Size / array 1,000,000,000 elements / vector
* Size / element 8 bytes bytes / double
= 24,000,000,000 bytes
= 23,437,500 kbytes
~ maxresident: 23,441,340 kbytes

Parallel execution

export OMP_NUM_THREADS=76
export KMP_AFFINITY="verbose,granularity=core,respect,scatter"
/usr/bin/time ./stream -n 1000000000
OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: 0-151
OMP: Info #214: KMP_AFFINITY: decoding x2APIC ids.
OMP: Info #156: KMP_AFFINITY: 152 available OS procs
OMP: Info #157: KMP_AFFINITY: Uniform topology
OMP: Info #285: KMP_AFFINITY: topology layer "LL cache" is equivalent to "socket".
OMP: Info #285: KMP_AFFINITY: topology layer "L3 cache" is equivalent to "socket".
OMP: Info #285: KMP_AFFINITY: topology layer "L2 cache" is equivalent to "core".
OMP: Info #285: KMP_AFFINITY: topology layer "L1 cache" is equivalent to "core".
OMP: Info #191: KMP_AFFINITY: 2 sockets x 38 cores/socket x 2 threads/core (76 total cores)
OMP: Info #216: KMP_AFFINITY: OS proc to physical thread map:
OMP: Info #171: KMP_AFFINITY: OS proc 0 maps to socket 0 core 0 thread 0 
...
OMP: Info #171: KMP_AFFINITY: OS proc 151 maps to socket 1 core 37 thread 1 
OMP: Info #144: KMP_AFFINITY: Threads may migrate across 1 innermost levels of machine
OMP: Info #252: KMP_AFFINITY: pid 4093998 tid 4093998 thread 0 bound to OS proc set 0,76
...
OMP: Info #252: KMP_AFFINITY: pid 4093998 tid 4094074 thread 75 bound to OS proc set 75,151
...
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 999999944 (elements)
Memory per array = 7629.4 MiB (= 7.5 GiB).
Total memory required = 22888.2 MiB (= 22.4 GiB).
Each kernel will be executed 10 times.
 The *best* time for each kernel (excluding the first iteration)
 will be used to compute the reported bandwidth.
-------------------------------------------------------------
OpenMP version (yyyymm): 201611
Number of Threads requested = 76
Number of Threads counted = 76
-------------------------------------------------------------
Your clock granularity appears to be 1000 ticks per microseconds.
Each test below will take on the order of 55652 microseconds.
   (= 55652783 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function    Best Rate MB/s  Med time     Min time     Max time
Copy:          315994.4     0.050745     0.050634     0.052998
Scale:         313958.3     0.051009     0.050962     0.052837
Add:           319920.9     0.075328     0.075019     0.075730
Triad:         318699.6     0.075450     0.075306     0.075544
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------
227.16user 15.61system 0:03.22elapsed 7535%CPU (0avgtext+0avgdata 23451024maxresident)k
536inputs+0outputs (1major+116483minor)pagefaults 0swaps

Relation user, sys and elapsed time:

User time 227.16 seconds
+ Sys time 15.61 seconds
= 242.77 seconds
/ 7535 %CPU 3.22 seconds
= Elapsed time 0:03.22 minutes