Tools/time/example_stream
Example /usr/bin/time
:
stream
Build benchmark
module add compiler/intel/19.1
icc -std=c11 -Ofast -xHost -ipo -qopenmp \
-o stream stream.c
Serial execution
export OMP_NUM_THREADS=1
/usr/bin/time ./stream -n 1000000000
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 1000000000 (elements)
Memory per array = 7629.4 MiB (= 7.5 GiB).
Total memory required = 22888.2 MiB (= 22.4 GiB).
Each kernel will be executed 10 times.
The *best* time for each kernel (excluding the first iteration)
will be used to compute the reported bandwidth.
-------------------------------------------------------------
OpenMP version (yyyymm): 201611
Number of Threads requested = 1
Number of Threads counted = 1
-------------------------------------------------------------
Your clock granularity appears to be 1000 ticks per microseconds.
Each test below will take on the order of 696360 microseconds.
(= 696360085 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Best Rate MB/s Med time Min time Max time
Copy: 21819.6 0.734463 0.733285 0.735423
Scale: 21213.1 0.755385 0.754252 0.757456
Add: 18866.7 1.272457 1.272085 1.274740
Triad: 18830.8 1.275592 1.274509 1.280414
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------
48.67user 2.92system 0:51.69elapsed 99%CPU (0avgtext+0avgdata 23441340maxresident)k
2048inputs+0outputs (11major+52234minor)pagefaults 0swaps
What causes the high system time?: Memory page allocation and wiping!
Relation user, sys and elapsed time:
User time | 48.67 seconds | |
+ | Sys time | 2.92 seconds |
= | 51.59 seconds | |
~ | Elapsed time | 0:51.69 minutes |
Relation vector size and maxresident:
Number of arrays | 3 | vectors a, b, c | |
* | Size / array | 1,000,000,000 | elements / vector |
* | Size / element | 8 bytes | bytes / double |
= | 24,000,000,000 bytes | ||
= | 23,437,500 kbytes | ||
~ | maxresident: | 23,441,340 kbytes | |
Parallel execution
export OMP_NUM_THREADS=76
export KMP_AFFINITY="verbose,granularity=core,respect,scatter"
/usr/bin/time ./stream -n 1000000000
OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: 0-151
OMP: Info #214: KMP_AFFINITY: decoding x2APIC ids.
OMP: Info #156: KMP_AFFINITY: 152 available OS procs
OMP: Info #157: KMP_AFFINITY: Uniform topology
OMP: Info #285: KMP_AFFINITY: topology layer "LL cache" is equivalent to "socket".
OMP: Info #285: KMP_AFFINITY: topology layer "L3 cache" is equivalent to "socket".
OMP: Info #285: KMP_AFFINITY: topology layer "L2 cache" is equivalent to "core".
OMP: Info #285: KMP_AFFINITY: topology layer "L1 cache" is equivalent to "core".
OMP: Info #191: KMP_AFFINITY: 2 sockets x 38 cores/socket x 2 threads/core (76 total cores)
OMP: Info #216: KMP_AFFINITY: OS proc to physical thread map:
OMP: Info #171: KMP_AFFINITY: OS proc 0 maps to socket 0 core 0 thread 0
...
OMP: Info #171: KMP_AFFINITY: OS proc 151 maps to socket 1 core 37 thread 1
OMP: Info #144: KMP_AFFINITY: Threads may migrate across 1 innermost levels of machine
OMP: Info #252: KMP_AFFINITY: pid 4093998 tid 4093998 thread 0 bound to OS proc set 0,76
...
OMP: Info #252: KMP_AFFINITY: pid 4093998 tid 4094074 thread 75 bound to OS proc set 75,151
...
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 999999944 (elements)
Memory per array = 7629.4 MiB (= 7.5 GiB).
Total memory required = 22888.2 MiB (= 22.4 GiB).
Each kernel will be executed 10 times.
The *best* time for each kernel (excluding the first iteration)
will be used to compute the reported bandwidth.
-------------------------------------------------------------
OpenMP version (yyyymm): 201611
Number of Threads requested = 76
Number of Threads counted = 76
-------------------------------------------------------------
Your clock granularity appears to be 1000 ticks per microseconds.
Each test below will take on the order of 55652 microseconds.
(= 55652783 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Best Rate MB/s Med time Min time Max time
Copy: 315994.4 0.050745 0.050634 0.052998
Scale: 313958.3 0.051009 0.050962 0.052837
Add: 319920.9 0.075328 0.075019 0.075730
Triad: 318699.6 0.075450 0.075306 0.075544
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------
227.16user 15.61system 0:03.22elapsed 7535%CPU (0avgtext+0avgdata 23451024maxresident)k
536inputs+0outputs (1major+116483minor)pagefaults 0swaps
Relation user, sys and elapsed time:
User time | 227.16 seconds | |
+ | Sys time | 15.61 seconds |
= | 242.77 seconds | |
/ | 7535 %CPU | 3.22 seconds |
= | Elapsed time | 0:03.22 minutes |