wiki:Tools/time/example_stream

Example /usr/bin/time: stream

Build benchmark

module add compiler/intel/18.0
icc -std=c11 -Ofast -xHost -ipo -qopenmp \
    stream.c -o stream

Serial execution

export OMP_NUM_THREADS=1
/usr/bin/time ./stream -n 1000000000

Output:

-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 1000000000 (elements) (elements)
Memory per array = 7629.4 MiB (= 7.5 GiB).
Total memory required = 22888.2 MiB (= 22.4 GiB).
Each kernel will be executed 10 times.
 The *best* time for each kernel (excluding the first iteration)
 will be used to compute the reported bandwidth.
-------------------------------------------------------------
Number of Threads requested = 1
Number of Threads counted = 1
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 4183426 microseconds.
   (= 4183426 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:           19025.2     1.088974     0.840990     1.476501
Scale:          19947.5     1.110411     0.802105     1.578164
Add:            16278.6     1.734772     1.474325     2.261711
Triad:          15381.1     1.733459     1.560358     1.955591
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------
56.73user 36.21system 1:32.95elapsed 99%CPU (0avgtext+0avgdata 23439444maxresident)k
0inputs+0outputs (0major+19348593minor)pagefaults 0swaps

What causes the high system time?: Memory page allocation and wiping!

Relation user, sys and elapsed time:

User time 56.73 seconds
+ Sys time 36.21 seconds
= 92,94 seconds
= 1:32,94 minutes
~ Elapsed time 1:32.95 minutes

Relation vector size and maxresident:

Number of arrays 3 vectors a, b, c
* Size / array 1000000000 elements / vector
* Size / element 8 bytes bytes / double
= 24000000000 bytes
= 23437500 kbytes
~ maxresident: 23439444 kbytes

Parallel execution

export OMP_NUM_THREADS=28
export KMP_AFFINITY="verbose,granularity=core,respect,scatter"
/usr/bin/time ./stream -n 1000000000

Output:

OMP: Info #204: KMP_AFFINITY: decoding x2APIC ids.
OMP: Info #202: KMP_AFFINITY: Affinity capable, using global cpuid leaf 11 info
OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55}
OMP: Info #156: KMP_AFFINITY: 56 available OS procs
OMP: Info #157: KMP_AFFINITY: Uniform topology
OMP: Info #179: KMP_AFFINITY: 2 packages x 14 cores/pkg x 2 threads/core (28 total cores)
OMP: Info #206: KMP_AFFINITY: OS proc to physical thread map:
OMP: Info #171: KMP_AFFINITY: OS proc 0 maps to package 0 core 0 thread 0 
OMP: Info #171: KMP_AFFINITY: OS proc 28 maps to package 0 core 0 thread 1 
...
OMP: Info #144: KMP_AFFINITY: Threads may migrate across 1 innermost levels of machine
OMP: Info #242: KMP_AFFINITY: pid 9090 tid 9090 thread 0 bound to OS proc set {0,28}
OMP: Info #242: KMP_AFFINITY: pid 9090 tid 9091 thread 1 bound to OS proc set {14,42}
...
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 999999980 (elements) (elements)
Memory per array = 7629.4 MiB (= 7.5 GiB).
Total memory required = 22888.2 MiB (= 22.4 GiB).
Each kernel will be executed 10 times.
 The *best* time for each kernel (excluding the first iteration)
 will be used to compute the reported bandwidth.
-------------------------------------------------------------
Number of Threads requested = 28
Number of Threads counted = 28
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 134990 microseconds.
   (= 134990 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:          114364.0     0.204036     0.139904     0.236524
Scale:         108420.9     0.208714     0.147573     0.232585
Add:            94078.1     0.267538     0.255107     0.275008
Triad:          92456.1     0.271514     0.259583     0.316614
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------
261.81user 77.71system 0:15.35elapsed 2211%CPU (0avgtext+0avgdata 23441696maxresident)k
0inputs+0outputs (0major+11917304minor)pagefaults 0swaps

Relation user, sys and elapsed time:

User time 261.81 seconds
+ Sys time 77.71 seconds
= 339.52 seconds
/ 2211 %CPU 15.35 seconds
= 0:15.35 minutes
= Elapsed time 0:15.35 minutes
Last modified 9 days ago Last modified on Apr 1, 2019, 4:42:19 PM