Example: likwid-bench
on Intel Xeon Haswell
- List available micro benchmarks
likwid-bench -a | grep -e stream_avx -e stream_mem_avx
stream_avx - Double-precision stream triad A(i) = B(i)*c + C(i), optimized for AVX stream_avx512 - Double-precision stream triad A(i) = B(i)*c + C(i), optimized for AVX- stream_avx_fma - Double-precision stream triad A(i) = B(i)*c + C(i), optimized for AVX FMAs stream_mem_avx - Double-precision stream triad A(i) = B(i)*c + C(i), uses AVX and non-temporal stores stream_mem_avx_fma - Double-precision stream triad A(i) = B(i)*c + C(i), optimized for AVX FMAs and non-temporal stores
- List properties of test
likwid-bench -l stream_mem_avx_fma
Name: stream_mem_avx_fma Description: Double-precision stream triad A(i) = B(i)*c + C(i), optimized for AVX FMAs and non-temporal stores Number of streams: 3 Loop stride: 16 Data Type: Double precision float Flops per element: 2 Bytes per element: 24 Load bytes per element: 16 Store bytes per element: 8 Load Ops: 2 Store Ops: 1 Constant instructions: 17 Loop instructions: 15 Loop micro Ops (μOPs): 22
- Switch off hyper-threads to run benchmark only on physical cores
sudo -i for NUM in {20..39} do echo 0 > /sys/devices/system/cpu/cpu${NUM}/online done
- List available thread domains
likwid-bench -p
Number of Domains 7 Domain 0: Tag N: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Domain 1: Tag S0: 0 1 2 3 4 5 6 7 8 9 Domain 2: Tag S1: 10 11 12 13 14 15 16 17 18 19 Domain 3: Tag C0: 0 1 2 3 4 5 6 7 8 9 Domain 4: Tag C1: 10 11 12 13 14 15 16 17 18 19 Domain 5: Tag M0: 0 1 2 3 4 5 6 7 8 9 Domain 6: Tag M1: 10 11 12 13 14 15 16 17 18 19
- Run micro benchmark
stream_mem_avx_fma
on memory domain 0 and 1 with 10 threads eachlikwid-bench -t stream_mem_avx_fma -w M0:16GB:10 -w M1:16GB:10
Warning: Sanitizing vector length to a multiple of the loop stride 16 and thread count 10 from 666666666 elements (-1179869200 bytes) to 666666560 elements (-1179871744 bytes) Allocate: Process running on core 0 (Domain M0) - Vector length 666666560/5333332480 Offset 0 Alignment 512 Allocate: Process running on core 0 (Domain M0) - Vector length 666666560/5333332480 Offset 0 Alignment 512 Allocate: Process running on core 0 (Domain M0) - Vector length 666666560/5333332480 Offset 0 Alignment 512 Warning: Sanitizing vector length to a multiple of the loop stride 16 and thread count 10 from 666666666 elements (-1179869200 bytes) to 666666560 elements (-1179871744 bytes) Allocate: Process running on core 10 (Domain M1) - Vector length 666666560/5333332480 Offset 0 Alignment 512 Allocate: Process running on core 10 (Domain M1) - Vector length 666666560/5333332480 Offset 0 Alignment 512 Allocate: Process running on core 10 (Domain M1) - Vector length 666666560/5333332480 Offset 0 Alignment 512 -------------------------------------------------------------------------------- LIKWID MICRO BENCHMARK Test: stream_mem_avx_fma -------------------------------------------------------------------------------- Using 2 work groups Using 20 threads -------------------------------------------------------------------------------- Running without Marker API. Activate Marker API with -m on commandline. -------------------------------------------------------------------------------- Automatic iteration count detection: 8 iterations per thread Sanitizing iterations count per thread to 10 Group: 0 Thread 0 Global Thread 0 running on core 0 - Vector length 66666656 Offset 0 Group: 0 Thread 1 Global Thread 1 running on core 1 - Vector length 66666656 Offset 66666656 Group: 0 Thread 2 Global Thread 2 running on core 2 - Vector length 66666656 Offset 133333312 Group: 0 Thread 3 Global Thread 3 running on core 3 - Vector length 66666656 Offset 199999968 Group: 0 Thread 4 Global Thread 4 running on core 4 - Vector length 66666656 Offset 266666624 Group: 0 Thread 5 Global Thread 5 running on core 5 - Vector length 66666656 Offset 333333280 Group: 0 Thread 6 Global Thread 6 running on core 6 - Vector length 66666656 Offset 399999936 Group: 0 Thread 7 Global Thread 7 running on core 7 - Vector length 66666656 Offset 466666592 Group: 0 Thread 8 Global Thread 8 running on core 8 - Vector length 66666656 Offset 533333248 Group: 0 Thread 9 Global Thread 9 running on core 9 - Vector length 66666656 Offset 599999904 Group: 1 Thread 0 Global Thread 10 running on core 10 - Vector length 66666656 Offset 0 Group: 1 Thread 1 Global Thread 11 running on core 11 - Vector length 66666656 Offset 66666656 Group: 1 Thread 2 Global Thread 12 running on core 12 - Vector length 66666656 Offset 133333312 Group: 1 Thread 3 Global Thread 13 running on core 13 - Vector length 66666656 Offset 199999968 Group: 1 Thread 4 Global Thread 14 running on core 14 - Vector length 66666656 Offset 266666624 Group: 1 Thread 5 Global Thread 15 running on core 15 - Vector length 66666656 Offset 333333280 Group: 1 Thread 6 Global Thread 16 running on core 16 - Vector length 66666656 Offset 399999936 Group: 1 Thread 7 Global Thread 17 running on core 17 - Vector length 66666656 Offset 466666592 Group: 1 Thread 8 Global Thread 18 running on core 18 - Vector length 66666656 Offset 533333248 Group: 1 Thread 9 Global Thread 19 running on core 19 - Vector length 66666656 Offset 599999904 -------------------------------------------------------------------------------- Cycles: 7874556080 CPU Clock: 2600020831 Cycle Clock: 2600020831 Time: 3.028651e+00 sec Iterations: 200 Iterations per thread: 10 Inner loop executions: 4166666 Size (Byte): 31999994880 Size per thread: 1599999744 Number of Flops: 26666662400 MFlops/s: 8804.80 Data volume (Byte): 319999948800 MByte/s: 105657.58 Cycles per update: 0.590592 Cycles per cacheline: 4.724734 Loads per update: 2 Stores per update: 1 Load bytes per element: 16 Store bytes per elem.: 8 Load/store ratio: 2.00 Instructions: 12499998017 UOPs: 18333330400 --------------------------------------------------------------------------------
- Loop over number of cores used
NUM_CORES=1 while (( NUM_CORES <= 10 )) do likwid-bench \ -t stream_mem_avx_fma \ -w M0:16GB:${NUM_CORES} \ -w M1:16GB:${NUM_CORES} let NUM_CORES++ done
#Cores | stream_mem_avx_fma (MByte/s) | % of Max |
---|---|---|
1 | 26833.17 | 27 % |
2 | 47471.31 | 48 % |
3 | 74179.19 | 75 % |
4 | 82817.26 | 84 % |
5 | 87214.52 | 88 % |
6 | 92342.30 | 94 % |
7 | 90999.28 | 92 % |
8 | 95753.62 | 97 % |
9 | 94630.97 | 96 % |
10 | 98636.65 | 100 % |
=> All cores are needed to get full memory bandwidth
=> One core can only get about 1/4 of full memory bandwidth
- Loop over memory size used (L1 (32 kB), L2 (256 kB), L3 (25 MB))
MEM_SIZE=2 while (( MEM_SIZE <= 16*1024*1024 )) do likwid-bench \ -t stream_mem_avx_fma \ -w M0:${MEM_SIZE}KB:10 \ -w M1:${MEM_SIZE}KB:10 let MEM_SIZE*=2 done
MEM_SIZE | stream_mem_avx_fma (MByte/s) | stream_avx_fma (MByte/s) | stream_mem_avx_fma / stream_avx_fma | stream_avx_fma / stream_mem_avx_fma |
---|---|---|---|---|
4 | 42945.16 | 824739.25 | 0.05 | 19.20 |
8 | 83232.09 | 1236519.03 | 0.07 | 14.86 |
16 | 167897.25 | 1623597.13 | 0.10 | 9.67 |
32 | 215223.51 | 2277613.35 | 0.09 | 10.58 |
64 | 279613.92 | 2801204.54 | 0.10 | 10.02 |
128 | 269573.06 | 3118208.38 | 0.09 | 11.57 |
256 | 277547.37 | 3042734.41 | 0.09 | 10.96 |
512 | 277803.36 | 1009421.85 | 0.28 | 3.63 |
1024 | 275349.61 | 1045113.89 | 0.26 | 3.80 |
2048 | 270680.18 | 858678.77 | 0.32 | 3.17 |
4096 | 267400.84 | 520921.82 | 0.51 | 1.95 |
8192 | 273372.58 | 511276.22 | 0.53 | 1.87 |
16384 | 273382.08 | 506923.04 | 0.54 | 1.85 |
32768 | 229349.34 | 131467.11 | 1.74 | 0.57 |
65536 | 116941.18 | 85563.47 | 1.37 | 0.73 |
131072 | 111224.09 | 85077.25 | 1.31 | 0.76 |
262144 | 111083.54 | 84562.72 | 1.31 | 0.76 |
524288 | 110716.00 | 83594.98 | 1.32 | 0.76 |
1048576 | 109442.93 | 83056.92 | 1.32 | 0.76 |
2097152 | 107109.03 | 82054.04 | 1.31 | 0.77 |
4194304 | 106147.95 | 77421.50 | 1.37 | 0.73 |
8388608 | 100948.89 | 73255.16 | 1.38 | 0.73 |
16777216 | 91840.69 | 71505.27 | 1.28 | 0.78 |
=> stream_avx_fma
does not use streaming stores. All stores go to the cache first. As long as you stay in the cache stream_avx_fma
is much faster as stream_mem_avx_fma
=> stream_mem_avx_fma
use streaming stores. All stores go directly to main memory. When you leave the cache stream_mem_avx_fma
is faster than stream_avx_fma
Last modified 7 days ago
Last modified on Apr 9, 2018, 3:55:54 PM