Example /usr/bin/time
: stream
Build benchmark
module add compiler/intel/18.0 icc -std=c11 -Ofast -xHost -ipo -qopenmp \ stream.c -o stream
Serial execution
export OMP_NUM_THREADS=1 /usr/bin/time ./stream -n 1000000000
Output:
------------------------------------------------------------- STREAM version $Revision: 5.10 $ ------------------------------------------------------------- This system uses 8 bytes per array element. ------------------------------------------------------------- Array size = 1000000000 (elements) (elements) Memory per array = 7629.4 MiB (= 7.5 GiB). Total memory required = 22888.2 MiB (= 22.4 GiB). Each kernel will be executed 10 times. The *best* time for each kernel (excluding the first iteration) will be used to compute the reported bandwidth. ------------------------------------------------------------- Number of Threads requested = 1 Number of Threads counted = 1 ------------------------------------------------------------- Your clock granularity/precision appears to be 1 microseconds. Each test below will take on the order of 4183426 microseconds. (= 4183426 clock ticks) Increase the size of the arrays if this shows that you are not getting at least 20 clock ticks per test. ------------------------------------------------------------- WARNING -- The above is only a rough guideline. For best results, please be sure you know the precision of your system timer. ------------------------------------------------------------- Function Best Rate MB/s Avg time Min time Max time Copy: 19025.2 1.088974 0.840990 1.476501 Scale: 19947.5 1.110411 0.802105 1.578164 Add: 16278.6 1.734772 1.474325 2.261711 Triad: 15381.1 1.733459 1.560358 1.955591 ------------------------------------------------------------- Solution Validates: avg error less than 1.000000e-13 on all three arrays -------------------------------------------------------------
56.73user 36.21system 1:32.95elapsed 99%CPU (0avgtext+0avgdata 23439444maxresident)k 0inputs+0outputs (0major+19348593minor)pagefaults 0swaps
What causes the high system time?: Memory page allocation and wiping!
Relation user, sys and elapsed time:
User time | 56.73 seconds | |
+ | Sys time | 36.21 seconds |
= | 92,94 seconds | |
= | 1:32,94 minutes | |
~ | Elapsed time | 1:32.95 minutes |
Relation vector size and maxresident:
Number of arrays | 3 | vectors a, b, c | |
* | Size / array | 1000000000 | elements / vector |
* | Size / element | 8 bytes | bytes / double |
= | 24000000000 bytes | ||
= | 23437500 kbytes | ||
~ | maxresident: | 23439444 kbytes |
Parallel execution
export OMP_NUM_THREADS=28 export KMP_AFFINITY="verbose,granularity=core,respect,scatter" /usr/bin/time ./stream -n 1000000000
Output:
OMP: Info #204: KMP_AFFINITY: decoding x2APIC ids. OMP: Info #202: KMP_AFFINITY: Affinity capable, using global cpuid leaf 11 info OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55} OMP: Info #156: KMP_AFFINITY: 56 available OS procs OMP: Info #157: KMP_AFFINITY: Uniform topology OMP: Info #179: KMP_AFFINITY: 2 packages x 14 cores/pkg x 2 threads/core (28 total cores) OMP: Info #206: KMP_AFFINITY: OS proc to physical thread map: OMP: Info #171: KMP_AFFINITY: OS proc 0 maps to package 0 core 0 thread 0 OMP: Info #171: KMP_AFFINITY: OS proc 28 maps to package 0 core 0 thread 1 ... OMP: Info #144: KMP_AFFINITY: Threads may migrate across 1 innermost levels of machine OMP: Info #242: KMP_AFFINITY: pid 9090 tid 9090 thread 0 bound to OS proc set {0,28} OMP: Info #242: KMP_AFFINITY: pid 9090 tid 9091 thread 1 bound to OS proc set {14,42} ...
------------------------------------------------------------- STREAM version $Revision: 5.10 $ ------------------------------------------------------------- This system uses 8 bytes per array element. ------------------------------------------------------------- Array size = 999999980 (elements) (elements) Memory per array = 7629.4 MiB (= 7.5 GiB). Total memory required = 22888.2 MiB (= 22.4 GiB). Each kernel will be executed 10 times. The *best* time for each kernel (excluding the first iteration) will be used to compute the reported bandwidth. ------------------------------------------------------------- Number of Threads requested = 28 Number of Threads counted = 28 ------------------------------------------------------------- Your clock granularity/precision appears to be 1 microseconds. Each test below will take on the order of 134990 microseconds. (= 134990 clock ticks) Increase the size of the arrays if this shows that you are not getting at least 20 clock ticks per test. ------------------------------------------------------------- WARNING -- The above is only a rough guideline. For best results, please be sure you know the precision of your system timer. ------------------------------------------------------------- Function Best Rate MB/s Avg time Min time Max time Copy: 114364.0 0.204036 0.139904 0.236524 Scale: 108420.9 0.208714 0.147573 0.232585 Add: 94078.1 0.267538 0.255107 0.275008 Triad: 92456.1 0.271514 0.259583 0.316614 ------------------------------------------------------------- Solution Validates: avg error less than 1.000000e-13 on all three arrays -------------------------------------------------------------
261.81user 77.71system 0:15.35elapsed 2211%CPU (0avgtext+0avgdata 23441696maxresident)k 0inputs+0outputs (0major+11917304minor)pagefaults 0swaps
Relation user, sys and elapsed time:
User time | 261.81 seconds | |
+ | Sys time | 77.71 seconds |
= | 339.52 seconds | |
/ | 2211 %CPU | 15.35 seconds |
= | 0:15.35 minutes | |
= | Elapsed time | 0:15.35 minutes |
Last modified 9 days ago
Last modified on Apr 1, 2019, 4:42:19 PM