wiki:Tools/likwid/example_bench

Example: likwid-bench on Intel Xeon Haswell

  • List available micro benchmarks
    likwid-bench -a | grep -e stream_avx -e stream_mem_avx
    
    stream_avx - Double-precision stream triad A(i) = B(i)*c + C(i), optimized for AVX
    stream_avx512 - Double-precision stream triad A(i) = B(i)*c + C(i), optimized for AVX-
    stream_avx_fma - Double-precision stream triad A(i) = B(i)*c + C(i), optimized for AVX FMAs
    stream_mem_avx - Double-precision stream triad A(i) = B(i)*c + C(i), uses AVX and non-temporal stores
    stream_mem_avx_fma - Double-precision stream triad A(i) = B(i)*c + C(i), optimized for AVX FMAs and non-temporal stores
    
  • List properties of test
    likwid-bench -l stream_mem_avx_fma
    
    Name: stream_mem_avx_fma
    Description: Double-precision stream triad A(i) = B(i)*c + C(i), optimized for AVX FMAs and non-temporal stores
    Number of streams: 3
    Loop stride: 16
    Data Type: Double precision float
    Flops per element: 2
    Bytes per element: 24
    Load bytes per element: 16
    Store bytes per element: 8
    Load Ops: 2
    Store Ops: 1
    Constant instructions: 17
    Loop instructions: 15
    Loop micro Ops (μOPs): 22
    
  • Switch off hyper-threads to run benchmark only on physical cores
    sudo -i
    for NUM in {20..39}
    do
        echo 0 > /sys/devices/system/cpu/cpu${NUM}/online
    done
    
  • List available thread domains
    likwid-bench -p
    
    Number of Domains 7
    Domain 0:
            Tag N: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
    Domain 1:
            Tag S0: 0 1 2 3 4 5 6 7 8 9
    Domain 2:
            Tag S1: 10 11 12 13 14 15 16 17 18 19
    Domain 3:
            Tag C0: 0 1 2 3 4 5 6 7 8 9
    Domain 4:
            Tag C1: 10 11 12 13 14 15 16 17 18 19
    Domain 5:
            Tag M0: 0 1 2 3 4 5 6 7 8 9
    Domain 6:
            Tag M1: 10 11 12 13 14 15 16 17 18 19
    
  • Run micro benchmark stream_mem_avx_fma on memory domain 0 and 1 with 10 threads each
    likwid-bench -t stream_mem_avx_fma -w M0:16GB:10 -w M1:16GB:10
    
    Warning: Sanitizing vector length to a multiple of the loop stride 16 and thread count 10 from 666666666 elements (-1179869200 bytes) to 666666560 elements (-1179871744 bytes)
    Allocate: Process running on core 0 (Domain M0) - Vector length 666666560/5333332480 Offset 0 Alignment 512
    Allocate: Process running on core 0 (Domain M0) - Vector length 666666560/5333332480 Offset 0 Alignment 512
    Allocate: Process running on core 0 (Domain M0) - Vector length 666666560/5333332480 Offset 0 Alignment 512
    Warning: Sanitizing vector length to a multiple of the loop stride 16 and thread count 10 from 666666666 elements (-1179869200 bytes) to 666666560 elements (-1179871744 bytes)
    Allocate: Process running on core 10 (Domain M1) - Vector length 666666560/5333332480 Offset 0 Alignment 512
    Allocate: Process running on core 10 (Domain M1) - Vector length 666666560/5333332480 Offset 0 Alignment 512
    Allocate: Process running on core 10 (Domain M1) - Vector length 666666560/5333332480 Offset 0 Alignment 512
    --------------------------------------------------------------------------------
    LIKWID MICRO BENCHMARK
    Test: stream_mem_avx_fma
    --------------------------------------------------------------------------------
    Using 2 work groups
    Using 20 threads
    --------------------------------------------------------------------------------
    Running without Marker API. Activate Marker API with -m on commandline.
    --------------------------------------------------------------------------------
    Automatic iteration count detection: 8 iterations per thread
    Sanitizing iterations count per thread to 10
    Group: 0 Thread 0 Global Thread  0 running on core  0 - Vector length 66666656 Offset 0
    Group: 0 Thread 1 Global Thread  1 running on core  1 - Vector length 66666656 Offset 66666656
    Group: 0 Thread 2 Global Thread  2 running on core  2 - Vector length 66666656 Offset 133333312
    Group: 0 Thread 3 Global Thread  3 running on core  3 - Vector length 66666656 Offset 199999968
    Group: 0 Thread 4 Global Thread  4 running on core  4 - Vector length 66666656 Offset 266666624
    Group: 0 Thread 5 Global Thread  5 running on core  5 - Vector length 66666656 Offset 333333280
    Group: 0 Thread 6 Global Thread  6 running on core  6 - Vector length 66666656 Offset 399999936
    Group: 0 Thread 7 Global Thread  7 running on core  7 - Vector length 66666656 Offset 466666592
    Group: 0 Thread 8 Global Thread  8 running on core  8 - Vector length 66666656 Offset 533333248
    Group: 0 Thread 9 Global Thread  9 running on core  9 - Vector length 66666656 Offset 599999904
    Group: 1 Thread 0 Global Thread 10 running on core 10 - Vector length 66666656 Offset 0
    Group: 1 Thread 1 Global Thread 11 running on core 11 - Vector length 66666656 Offset 66666656
    Group: 1 Thread 2 Global Thread 12 running on core 12 - Vector length 66666656 Offset 133333312
    Group: 1 Thread 3 Global Thread 13 running on core 13 - Vector length 66666656 Offset 199999968
    Group: 1 Thread 4 Global Thread 14 running on core 14 - Vector length 66666656 Offset 266666624
    Group: 1 Thread 5 Global Thread 15 running on core 15 - Vector length 66666656 Offset 333333280
    Group: 1 Thread 6 Global Thread 16 running on core 16 - Vector length 66666656 Offset 399999936
    Group: 1 Thread 7 Global Thread 17 running on core 17 - Vector length 66666656 Offset 466666592
    Group: 1 Thread 8 Global Thread 18 running on core 18 - Vector length 66666656 Offset 533333248
    Group: 1 Thread 9 Global Thread 19 running on core 19 - Vector length 66666656 Offset 599999904
    --------------------------------------------------------------------------------
    Cycles:                 7874556080
    CPU Clock:              2600020831
    Cycle Clock:            2600020831
    Time:                   3.028651e+00 sec
    Iterations:             200
    Iterations per thread:  10
    Inner loop executions:  4166666
    Size (Byte):            31999994880
    Size per thread:        1599999744
    Number of Flops:        26666662400
    MFlops/s:               8804.80
    Data volume (Byte):     319999948800
    MByte/s:                105657.58
    Cycles per update:      0.590592
    Cycles per cacheline:   4.724734
    Loads per update:       2
    Stores per update:      1
    Load bytes per element: 16
    Store bytes per elem.:  8
    Load/store ratio:       2.00
    Instructions:           12499998017
    UOPs:                   18333330400
    --------------------------------------------------------------------------------
    
  • Loop over number of cores used
    NUM_CORES=1
    while (( NUM_CORES <= 10 ))
    do
        likwid-bench \
            -t stream_mem_avx_fma \
            -w M0:16GB:${NUM_CORES} \
            -w M1:16GB:${NUM_CORES}
        let NUM_CORES++
    done
    
#Cores stream_mem_avx_fma
(MByte/s)
% of Max
1 26833.17 27 %
2 47471.31 48 %
3 74179.19 75 %
4 82817.26 84 %
5 87214.52 88 %
6 92342.30 94 %
7 90999.28 92 %
8 95753.62 97 %
9 94630.97 96 %
10 98636.65 100 %

=> All cores are needed to get full memory bandwidth

=> One core can only get about 1/4 of full memory bandwidth

  • Loop over memory size used (L1 (32 kB), L2 (256 kB), L3 (25 MB))
    MEM_SIZE=2
    while (( MEM_SIZE <= 16*1024*1024 ))
    do
        likwid-bench \
            -t stream_mem_avx_fma \
            -w M0:${MEM_SIZE}KB:10 \
            -w M1:${MEM_SIZE}KB:10
        let MEM_SIZE*=2
    done
    
MEM_SIZE stream_mem_avx_fma
(MByte/s)
stream_avx_fma
(MByte/s)
stream_mem_avx_fma
/ stream_avx_fma
stream_avx_fma
/ stream_mem_avx_fma
4 42945.16 824739.25 0.05 19.20
8 83232.09 1236519.03 0.07 14.86
16 167897.25 1623597.13 0.10 9.67
32 215223.51 2277613.35 0.09 10.58
64 279613.92 2801204.54 0.10 10.02
128 269573.06 3118208.38 0.09 11.57
256 277547.37 3042734.41 0.09 10.96
512 277803.36 1009421.85 0.28 3.63
1024 275349.61 1045113.89 0.26 3.80
2048 270680.18 858678.77 0.32 3.17
4096 267400.84 520921.82 0.51 1.95
8192 273372.58 511276.22 0.53 1.87
16384 273382.08 506923.04 0.54 1.85
32768 229349.34 131467.11 1.74 0.57
65536 116941.18 85563.47 1.37 0.73
131072 111224.09 85077.25 1.31 0.76
262144 111083.54 84562.72 1.31 0.76
524288 110716.00 83594.98 1.32 0.76
1048576 109442.93 83056.92 1.32 0.76
2097152 107109.03 82054.04 1.31 0.77
4194304 106147.95 77421.50 1.37 0.73
8388608 100948.89 73255.16 1.38 0.73
16777216 91840.69 71505.27 1.28 0.78

=> stream_avx_fma does not use streaming stores. All stores go to the cache first. As long as you stay in the cache stream_avx_fma is much faster as stream_mem_avx_fma

=> stream_mem_avx_fma use streaming stores. All stores go directly to main memory. When you leave the cache stream_mem_avx_fma is faster than stream_avx_fma

Last modified 7 days ago Last modified on Apr 9, 2018, 3:55:54 PM