Example: likwid-bench
on Intel Xeon Broadwell
- List available micro benchmarks
likwid-bench -a | \ grep -e stream_avx -e stream_mem_avx
stream_avx - Double-precision stream triad A(i) = B(i)*c + C(i), optimized for AVX stream_avx512 - Double-precision stream triad A(i) = B(i)*c + C(i), optimized for AVX- stream_avx512_fma - Double-precision stream triad A(i) = B(i)*c + C(i), optimized for AVX- stream_avx_fma - Double-precision stream triad A(i) = B(i)*c + C(i), optimized for AVX FMAs stream_mem_avx - Double-precision stream triad A(i) = B(i)*c + C(i), uses AVX and non-temporal stores stream_mem_avx_fma - Double-precision stream triad A(i) = B(i)*c + C(i), optimized for AVX FMAs and non-temporal stores
- List properties of test
likwid-bench -l stream_mem_avx_fma
Name: stream_mem_avx_fma Description: Double-precision stream triad A(i) = B(i)*c + C(i), optimized for AVX FMAs and non-temporal stores Number of streams: 3 Loop stride: 16 Data Type: Double precision float Flops per element: 2 Bytes per element: 24 Load bytes per element: 16 Store bytes per element: 8 Load Ops: 2 Store Ops: 1 Constant instructions: 17 Loop instructions: 15 Loop micro Ops (μOPs): 22
- List available thread domains
likwid-bench -p
Number of Domains 11 Domain 0: Tag N: 0 28 1 29 2 30 3 31 4 32 5 33 6 34 7 35 8 36 9 37 10 38 11 39 12 40 13 41 14 42 15 43 16 44 17 45 18 46 19 47 20 48 21 49 22 50 23 51 24 52 25 53 26 54 27 55 Domain 1: Tag S0: 0 28 1 29 2 30 3 31 4 32 5 33 6 34 7 35 8 36 9 37 10 38 11 39 12 40 13 41 Domain 2: Tag S1: 14 42 15 43 16 44 17 45 18 46 19 47 20 48 21 49 22 50 23 51 24 52 25 53 26 54 27 55 Domain 3: Tag C0: 0 28 1 29 2 30 3 31 4 32 5 33 6 34 Domain 4: Tag C1: 7 35 8 36 9 37 10 38 11 39 12 40 13 41 Domain 5: Tag C2: 14 42 15 43 16 44 17 45 18 46 19 47 20 48 Domain 6: Tag C3: 21 49 22 50 23 51 24 52 25 53 26 54 27 55 Domain 7: Tag M0: 0 28 1 29 2 30 3 31 4 32 5 33 6 34 Domain 8: Tag M1: 7 35 8 36 9 37 10 38 11 39 12 40 13 41 Domain 9: Tag M2: 14 42 15 43 16 44 17 45 18 46 19 47 20 48 Domain 10: Tag M3: 21 49 22 50 23 51 24 52 25 53 26 54 27 55
- Run micro benchmark
stream_mem_avx_fma
on memory domain 0 to 3 with 7 threads each skipping hyperthreadslikwid-bench \ -t stream_mem_avx_fma \ -i 40 \ -w M0:8GB:7:1:2 \ -w M1:8GB:7:1:2 \ -w M2:8GB:7:1:2 \ -w M3:8GB:7:1:2
Warning: Sanitizing vector length to a multiple of the loop stride 16 and thread count 7 from 333333333 elements (-589934600 bytes) to 333333280 elements (-589935872 bytes) Allocate: Process running on core 0 (Domain M0) - Vector length 333333280/2666666240 Offset 0 Alignment 512 Allocate: Process running on core 0 (Domain M0) - Vector length 333333280/2666666240 Offset 0 Alignment 512 Allocate: Process running on core 0 (Domain M0) - Vector length 333333280/2666666240 Offset 0 Alignment 512 Warning: Sanitizing vector length to a multiple of the loop stride 16 and thread count 7 from 333333333 elements (-589934600 bytes) to 333333280 elements (-589935872 bytes) Allocate: Process running on core 7 (Domain M1) - Vector length 333333280/2666666240 Offset 0 Alignment 512 Allocate: Process running on core 7 (Domain M1) - Vector length 333333280/2666666240 Offset 0 Alignment 512 Allocate: Process running on core 7 (Domain M1) - Vector length 333333280/2666666240 Offset 0 Alignment 512 Warning: Sanitizing vector length to a multiple of the loop stride 16 and thread count 7 from 333333333 elements (-589934600 bytes) to 333333280 elements (-589935872 bytes) Allocate: Process running on core 14 (Domain M2) - Vector length 333333280/2666666240 Offset 0 Alignment 512 Allocate: Process running on core 14 (Domain M2) - Vector length 333333280/2666666240 Offset 0 Alignment 512 Allocate: Process running on core 14 (Domain M2) - Vector length 333333280/2666666240 Offset 0 Alignment 512 Warning: Sanitizing vector length to a multiple of the loop stride 16 and thread count 7 from 333333333 elements (-589934600 bytes) to 333333280 elements (-589935872 bytes) Allocate: Process running on core 21 (Domain M3) - Vector length 333333280/2666666240 Offset 0 Alignment 512 Allocate: Process running on core 21 (Domain M3) - Vector length 333333280/2666666240 Offset 0 Alignment 512 Allocate: Process running on core 21 (Domain M3) - Vector length 333333280/2666666240 Offset 0 Alignment 512
-------------------------------------------------------------------------------- LIKWID MICRO BENCHMARK Test: stream_mem_avx_fma -------------------------------------------------------------------------------- Using 4 work groups Using 28 threads -------------------------------------------------------------------------------- Running without Marker API. Activate Marker API with -m on commandline. --------------------------------------------------------------------------------
Group: 2 Thread 1 Global Thread 15 running on core 15 - Vector length 47619040 Offset 47619040 Group: 1 Thread 6 Global Thread 13 running on core 13 - Vector length 47619040 Offset 285714240 Group: 0 Thread 0 Global Thread 0 running on core 0 - Vector length 47619040 Offset 0 Group: 2 Thread 4 Global Thread 18 running on core 18 - Vector length 47619040 Offset 190476160 Group: 3 Thread 3 Global Thread 24 running on core 24 - Vector length 47619040 Offset 142857120 Group: 2 Thread 2 Global Thread 16 running on core 16 - Vector length 47619040 Offset 95238080 Group: 3 Thread 4 Global Thread 25 running on core 25 - Vector length 47619040 Offset 190476160 Group: 2 Thread 0 Global Thread 14 running on core 14 - Vector length 47619040 Offset 0 Group: 1 Thread 3 Global Thread 10 running on core 10 - Vector length 47619040 Offset 142857120 Group: 3 Thread 2 Global Thread 23 running on core 23 - Vector length 47619040 Offset 95238080 Group: 0 Thread 5 Global Thread 5 running on core 5 - Vector length 47619040 Offset 238095200 Group: 3 Thread 5 Global Thread 26 running on core 26 - Vector length 47619040 Offset 238095200 Group: 0 Thread 6 Global Thread 6 running on core 6 - Vector length 47619040 Offset 285714240 Group: 0 Thread 1 Global Thread 1 running on core 1 - Vector length 47619040 Offset 47619040 Group: 0 Thread 2 Global Thread 2 running on core 2 - Vector length 47619040 Offset 95238080 Group: 0 Thread 3 Global Thread 3 running on core 3 - Vector length 47619040 Offset 142857120 Group: 1 Thread 5 Global Thread 12 running on core 12 - Vector length 47619040 Offset 238095200 Group: 1 Thread 0 Global Thread 7 running on core 7 - Vector length 47619040 Offset 0 Group: 1 Thread 1 Global Thread 8 running on core 8 - Vector length 47619040 Offset 47619040 Group: 1 Thread 2 Global Thread 9 running on core 9 - Vector length 47619040 Offset 95238080 Group: 2 Thread 3 Global Thread 17 running on core 17 - Vector length 47619040 Offset 142857120 Group: 3 Thread 1 Global Thread 22 running on core 22 - Vector length 47619040 Offset 47619040 Group: 2 Thread 5 Global Thread 19 running on core 19 - Vector length 47619040 Offset 238095200 Group: 3 Thread 6 Global Thread 27 running on core 27 - Vector length 47619040 Offset 285714240 Group: 0 Thread 4 Global Thread 4 running on core 4 - Vector length 47619040 Offset 190476160 Group: 3 Thread 0 Global Thread 21 running on core 21 - Vector length 47619040 Offset 0 Group: 1 Thread 4 Global Thread 11 running on core 11 - Vector length 47619040 Offset 190476160 Group: 2 Thread 6 Global Thread 20 running on core 20 - Vector length 47619040 Offset 285714240
--------------------------------------------------------------------------------
Cycles: 22030045389 CPU Clock: 1995371847 Cycle Clock: 1995371847 Time: 1.104057e+01 sec Iterations: 1120 Iterations per thread: 40 Inner loop executions: 2976190 Size (Byte): 31999994880 Size per thread: 1142856960 Number of Flops: 106666649600 MFlops/s: 9661.33 Data volume (Byte): 1279999795200 MByte/s: 115936.01 Cycles per update: 0.413063 Cycles per cacheline: 3.304507 Loads per update: 2 Stores per update: 1 Load bytes per element: 16 Store bytes per elem.: 8 Load/store ratio: 2.00 Instructions: 49999992017 UOPs: 73333321600
--------------------------------------------------------------------------------
- Loop over number of cores used
NUM_DOMAINS=4 for ((NUM_CORES=1; NUM_CORES <= 28; NUM_CORES++)) do # Distribute cores in round robin mode declare -i -a CORES_PER_DOMAIN=() for (( COUNT=0; COUNT < NUM_CORES; COUNT++)) do let CORES_PER_DOMAIN[$((COUNT % NUM_DOMAINS))]++ done COMMAND=( likwid-bench -t stream_mem_avx_fma -i 100 ) for ((DOMAIN=0; DOMAIN<NUM_DOMAINS; DOMAIN++)) do if [[ ${CORES_PER_DOMAIN[${DOMAIN}]} -gt 0 ]]; then COMMAND+=( -w M${DOMAIN}:8GB:${CORES_PER_DOMAIN[${DOMAIN}]}:1:2 ) fi done "${COMMAND[@]}" done 2>/dev/null | grep 'MByte/s:'
likwid-bench -t stream_mem_avx_fma -i 100 -w M0:8GB:1:1:2 likwid-bench -t stream_mem_avx_fma -i 100 -w M0:8GB:1:1:2 -w M1:8GB:1:1:2 likwid-bench -t stream_mem_avx_fma -i 100 -w M0:8GB:1:1:2 -w M1:8GB:1:1:2 -w M2:8GB:1:1:2 likwid-bench -t stream_mem_avx_fma -i 100 -w M0:8GB:1:1:2 -w M1:8GB:1:1:2 -w M2:8GB:1:1:2 -w M3:8GB:1:1:2 likwid-bench -t stream_mem_avx_fma -i 100 -w M0:8GB:2:1:2 -w M1:8GB:2:1:2 -w M2:8GB:2:1:2 -w M3:8GB:2:1:2 likwid-bench -t stream_mem_avx_fma -i 100 -w M0:8GB:4:1:2 -w M1:8GB:4:1:2 -w M2:8GB:4:1:2 -w M3:8GB:4:1:2 likwid-bench -t stream_mem_avx_fma -i 100 -w M0:8GB:7:1:2 -w M1:8GB:7:1:2 -w M2:8GB:7:1:2 -w M3:8GB:7:1:2
#Cores | stream_mem_avx_fma (MByte/s) | % of Max |
---|---|---|
1 | 16063.80 | 13 % |
2 | 31227.47 | 25 % |
3 | 47110.67 | 37 % |
4 | 64245.38 | 51 % |
8 | 102775.05 | 81 % |
16 | 122272.93 | 91 % |
28 | 126187.02 | 100 % |
=> 1 core can only get about 1/8 of full memory bandwidth
=> 4 cores are needed for 1/2 of full memory bandwidth
=> all cores are needed for full memory bandwidth
- Loop over memory size used (L1 (32 kB), L2 (256 kB), L3 (25 MB))
MEM_SIZE=2 while (( MEM_SIZE <= 16*1024*1024 )) do likwid-bench \ -t stream_mem_avx_fma \ -i 100 \ -w M0:${MEM_SIZE}KB:7:1:2 \ -w M1:${MEM_SIZE}KB:7:1:2 \ -w M2:${MEM_SIZE}KB:7:1:2 \ -w M3:${MEM_SIZE}KB:7:1:2 let MEM_SIZE*=2 done 2>/dev/null | grep -e 'Size (Byte):' -e 'MByte/s:'
MEM_SIZE | stream_mem_avx_fma (MByte/s) | stream_avx_fma (MByte/s) | stream_mem_avx_fma / stream_avx_fma | stream_avx_fma / stream_mem_avx_fma |
---|---|---|---|---|
10,5 | 18.750 | 26.672 | 0,70 | 1,42 |
21,0 | 29.354 | 121.348 | 0,24 | 4,13 |
52,5 | 49.569 | 111.199 | 0,45 | 2,24 |
115,5 | 85.302 | 310.242 | 0,27 | 3,64 |
241,5 | 84.466 | 303.598 | 0,28 | 3,59 |
493,5 | 88.457 | 1.168.594 | 0,08 | 13,21 |
997,5 | 89.229 | 897.323 | 0,10 | 10,06 |
1.995,0 | 115.272 | 866.554 | 0,13 | 7,52 |
3.990,0 | 149.004 | 849.090 | 0,18 | 5,70 |
7.990,5 | 152.373 | 385.012 | 0,40 | 2,53 |
15.991,5 | 177.235 | 231.882 | 0,76 | 1,31 |
31.993,5 | 184.341 | 221.023 | 0,83 | 1,20 |
63.997,5 | 197.247 | 361.096 | 0,55 | 1,83 |
127.995,0 | 167.116 | 105.212 | 1,59 | 0,63 |
255.990,0 | 129.356 | 99.404 | 1,30 | 0,77 |
511.990,5 | 128.616 | 99.228 | 1,30 | 0,77 |
1.023.991,5 | 128.589 | 99.264 | 1,30 | 0,77 |
2.047.993,5 | 128.013 | 99.239 | 1,29 | 0,78 |
4.095.997,5 | 124.954 | 97.184 | 1,29 | 0,78 |
8.191.995,0 | 123.248 | 95.195 | 1,29 | 0,77 |
16.383.990,0 | 124.524 | 94.926 | 1,31 | 0,76 |
32.767.990,5 | 124.579 | 94.747 | 1,31 | 0,76 |
65.535.991,5 | 124.994 | 94.375 | 1,32 | 0,76 |
=> stream_avx_fma
does not use streaming stores. All stores go to the cache first. As long as you stay in the cache stream_avx_fma
is much faster as stream_mem_avx_fma
=> stream_mem_avx_fma
use streaming stores. All stores go directly to main memory. When you leave the cache stream_mem_avx_fma
is faster than stream_avx_fma
Last modified 9 days ago
Last modified on Apr 1, 2019, 12:48:37 PM