wiki:Tools/likwid/example_bench_broadwell

Example: likwid-bench on Intel Xeon Broadwell

  • List available micro benchmarks
    likwid-bench -a | \
        grep -e stream_avx -e stream_mem_avx
    
    stream_avx - Double-precision stream triad A(i) = B(i)*c + C(i), optimized for AVX
    stream_avx512 - Double-precision stream triad A(i) = B(i)*c + C(i), optimized for AVX-
    stream_avx512_fma - Double-precision stream triad A(i) = B(i)*c + C(i), optimized for AVX-
    stream_avx_fma - Double-precision stream triad A(i) = B(i)*c + C(i), optimized for AVX FMAs
    stream_mem_avx - Double-precision stream triad A(i) = B(i)*c + C(i), uses AVX and non-temporal stores
    stream_mem_avx_fma - Double-precision stream triad A(i) = B(i)*c + C(i), optimized for AVX FMAs and non-temporal stores
    
  • List properties of test
    likwid-bench -l stream_mem_avx_fma
    
    Name: stream_mem_avx_fma
    Description: Double-precision stream triad A(i) = B(i)*c + C(i), optimized for AVX FMAs and non-temporal stores
    Number of streams: 3
    Loop stride: 16
    Data Type: Double precision float
    Flops per element: 2
    Bytes per element: 24
    Load bytes per element: 16
    Store bytes per element: 8
    Load Ops: 2
    Store Ops: 1
    Constant instructions: 17
    Loop instructions: 15
    Loop micro Ops (μOPs): 22
    
  • List available thread domains
    likwid-bench -p
    
    Number of Domains 11
    Domain 0:
            Tag N: 0 28 1 29 2 30 3 31 4 32 5 33 6 34 7 35 8 36 9 37 10 38 11 39 12 40 13 41 14 42 15 43 16 44 17 45 18 46 19 47 20 48 21 49 22 50 23 51 24 52 25 53 26 54 27 55
    Domain 1:
            Tag S0: 0 28 1 29 2 30 3 31 4 32 5 33 6 34 7 35 8 36 9 37 10 38 11 39 12 40 13 41
    Domain 2:
            Tag S1: 14 42 15 43 16 44 17 45 18 46 19 47 20 48 21 49 22 50 23 51 24 52 25 53 26 54 27 55
    Domain 3:
            Tag C0: 0 28 1 29 2 30 3 31 4 32 5 33 6 34
    Domain 4:
            Tag C1: 7 35 8 36 9 37 10 38 11 39 12 40 13 41
    Domain 5:
            Tag C2: 14 42 15 43 16 44 17 45 18 46 19 47 20 48
    Domain 6:
            Tag C3: 21 49 22 50 23 51 24 52 25 53 26 54 27 55
    Domain 7:
            Tag M0: 0 28 1 29 2 30 3 31 4 32 5 33 6 34
    Domain 8:
            Tag M1: 7 35 8 36 9 37 10 38 11 39 12 40 13 41
    Domain 9:
            Tag M2: 14 42 15 43 16 44 17 45 18 46 19 47 20 48
    Domain 10:
            Tag M3: 21 49 22 50 23 51 24 52 25 53 26 54 27 55
    
  • Run micro benchmark stream_mem_avx_fma on memory domain 0 to 3 with 7 threads each skipping hyperthreads
    likwid-bench \
        -t stream_mem_avx_fma \
        -i 40 \
        -w M0:8GB:7:1:2 \
        -w M1:8GB:7:1:2 \
        -w M2:8GB:7:1:2 \
        -w M3:8GB:7:1:2
    
    Warning: Sanitizing vector length to a multiple of the loop stride 16 and thread count 7 from 333333333 elements (-589934600 bytes) to 333333280 elements (-589935872 bytes)
    Allocate: Process running on core 0 (Domain M0) - Vector length 333333280/2666666240 Offset 0 Alignment 512
    Allocate: Process running on core 0 (Domain M0) - Vector length 333333280/2666666240 Offset 0 Alignment 512
    Allocate: Process running on core 0 (Domain M0) - Vector length 333333280/2666666240 Offset 0 Alignment 512
    Warning: Sanitizing vector length to a multiple of the loop stride 16 and thread count 7 from 333333333 elements (-589934600 bytes) to 333333280 elements (-589935872 bytes)
    Allocate: Process running on core 7 (Domain M1) - Vector length 333333280/2666666240 Offset 0 Alignment 512
    Allocate: Process running on core 7 (Domain M1) - Vector length 333333280/2666666240 Offset 0 Alignment 512
    Allocate: Process running on core 7 (Domain M1) - Vector length 333333280/2666666240 Offset 0 Alignment 512
    Warning: Sanitizing vector length to a multiple of the loop stride 16 and thread count 7 from 333333333 elements (-589934600 bytes) to 333333280 elements (-589935872 bytes)
    Allocate: Process running on core 14 (Domain M2) - Vector length 333333280/2666666240 Offset 0 Alignment 512
    Allocate: Process running on core 14 (Domain M2) - Vector length 333333280/2666666240 Offset 0 Alignment 512
    Allocate: Process running on core 14 (Domain M2) - Vector length 333333280/2666666240 Offset 0 Alignment 512
    Warning: Sanitizing vector length to a multiple of the loop stride 16 and thread count 7 from 333333333 elements (-589934600 bytes) to 333333280 elements (-589935872 bytes)
    Allocate: Process running on core 21 (Domain M3) - Vector length 333333280/2666666240 Offset 0 Alignment 512
    Allocate: Process running on core 21 (Domain M3) - Vector length 333333280/2666666240 Offset 0 Alignment 512
    Allocate: Process running on core 21 (Domain M3) - Vector length 333333280/2666666240 Offset 0 Alignment 512
    
    --------------------------------------------------------------------------------
    LIKWID MICRO BENCHMARK
    Test: stream_mem_avx_fma
    --------------------------------------------------------------------------------
    Using 4 work groups
    Using 28 threads
    --------------------------------------------------------------------------------
    Running without Marker API. Activate Marker API with -m on commandline.
    --------------------------------------------------------------------------------
    
    Group: 2 Thread 1 Global Thread 15 running on core 15 - Vector length 47619040 Offset 47619040
    Group: 1 Thread 6 Global Thread 13 running on core 13 - Vector length 47619040 Offset 285714240
    Group: 0 Thread 0 Global Thread 0 running on core 0 - Vector length 47619040 Offset 0
    Group: 2 Thread 4 Global Thread 18 running on core 18 - Vector length 47619040 Offset 190476160
    Group: 3 Thread 3 Global Thread 24 running on core 24 - Vector length 47619040 Offset 142857120
    Group: 2 Thread 2 Global Thread 16 running on core 16 - Vector length 47619040 Offset 95238080
    Group: 3 Thread 4 Global Thread 25 running on core 25 - Vector length 47619040 Offset 190476160
    Group: 2 Thread 0 Global Thread 14 running on core 14 - Vector length 47619040 Offset 0
    Group: 1 Thread 3 Global Thread 10 running on core 10 - Vector length 47619040 Offset 142857120
    Group: 3 Thread 2 Global Thread 23 running on core 23 - Vector length 47619040 Offset 95238080
    Group: 0 Thread 5 Global Thread 5 running on core 5 - Vector length 47619040 Offset 238095200
    Group: 3 Thread 5 Global Thread 26 running on core 26 - Vector length 47619040 Offset 238095200
    Group: 0 Thread 6 Global Thread 6 running on core 6 - Vector length 47619040 Offset 285714240
    Group: 0 Thread 1 Global Thread 1 running on core 1 - Vector length 47619040 Offset 47619040
    Group: 0 Thread 2 Global Thread 2 running on core 2 - Vector length 47619040 Offset 95238080
    Group: 0 Thread 3 Global Thread 3 running on core 3 - Vector length 47619040 Offset 142857120
    Group: 1 Thread 5 Global Thread 12 running on core 12 - Vector length 47619040 Offset 238095200
    Group: 1 Thread 0 Global Thread 7 running on core 7 - Vector length 47619040 Offset 0
    Group: 1 Thread 1 Global Thread 8 running on core 8 - Vector length 47619040 Offset 47619040
    Group: 1 Thread 2 Global Thread 9 running on core 9 - Vector length 47619040 Offset 95238080
    Group: 2 Thread 3 Global Thread 17 running on core 17 - Vector length 47619040 Offset 142857120
    Group: 3 Thread 1 Global Thread 22 running on core 22 - Vector length 47619040 Offset 47619040
    Group: 2 Thread 5 Global Thread 19 running on core 19 - Vector length 47619040 Offset 238095200
    Group: 3 Thread 6 Global Thread 27 running on core 27 - Vector length 47619040 Offset 285714240
    Group: 0 Thread 4 Global Thread 4 running on core 4 - Vector length 47619040 Offset 190476160
    Group: 3 Thread 0 Global Thread 21 running on core 21 - Vector length 47619040 Offset 0
    Group: 1 Thread 4 Global Thread 11 running on core 11 - Vector length 47619040 Offset 190476160
    Group: 2 Thread 6 Global Thread 20 running on core 20 - Vector length 47619040 Offset 285714240
    
    --------------------------------------------------------------------------------
    
    Cycles:                 22030045389
    CPU Clock:              1995371847
    Cycle Clock:            1995371847
    Time:                   1.104057e+01 sec
    Iterations:             1120
    Iterations per thread:  40
    Inner loop executions:  2976190
    Size (Byte):            31999994880
    Size per thread:        1142856960
    Number of Flops:        106666649600
    MFlops/s:               9661.33
    Data volume (Byte):     1279999795200
    MByte/s:                115936.01
    Cycles per update:      0.413063
    Cycles per cacheline:   3.304507
    Loads per update:       2
    Stores per update:      1
    Load bytes per element: 16
    Store bytes per elem.:  8
    Load/store ratio:       2.00
    Instructions:           49999992017
    UOPs:                   73333321600
    
    --------------------------------------------------------------------------------
    
  • Loop over number of cores used
    NUM_DOMAINS=4
    for ((NUM_CORES=1; NUM_CORES <= 28; NUM_CORES++))
    do
        # Distribute cores in round robin mode
        declare -i -a CORES_PER_DOMAIN=()
        for (( COUNT=0; COUNT < NUM_CORES; COUNT++))
        do
            let CORES_PER_DOMAIN[$((COUNT % NUM_DOMAINS))]++
        done
        COMMAND=(
            likwid-bench
                -t stream_mem_avx_fma
                -i 100
        )
        for ((DOMAIN=0; DOMAIN<NUM_DOMAINS; DOMAIN++))
        do
            if [[ ${CORES_PER_DOMAIN[${DOMAIN}]} -gt 0 ]]; then
                COMMAND+=( -w M${DOMAIN}:8GB:${CORES_PER_DOMAIN[${DOMAIN}]}:1:2 )
            fi
        done
        "${COMMAND[@]}"
    done 2>/dev/null |
       grep 'MByte/s:'
    
    likwid-bench -t stream_mem_avx_fma -i 100 -w M0:8GB:1:1:2
    likwid-bench -t stream_mem_avx_fma -i 100 -w M0:8GB:1:1:2 -w M1:8GB:1:1:2
    likwid-bench -t stream_mem_avx_fma -i 100 -w M0:8GB:1:1:2 -w M1:8GB:1:1:2 -w M2:8GB:1:1:2
    likwid-bench -t stream_mem_avx_fma -i 100 -w M0:8GB:1:1:2 -w M1:8GB:1:1:2 -w M2:8GB:1:1:2 -w M3:8GB:1:1:2
    likwid-bench -t stream_mem_avx_fma -i 100 -w M0:8GB:2:1:2 -w M1:8GB:2:1:2 -w M2:8GB:2:1:2 -w M3:8GB:2:1:2
    likwid-bench -t stream_mem_avx_fma -i 100 -w M0:8GB:4:1:2 -w M1:8GB:4:1:2 -w M2:8GB:4:1:2 -w M3:8GB:4:1:2
    likwid-bench -t stream_mem_avx_fma -i 100 -w M0:8GB:7:1:2 -w M1:8GB:7:1:2 -w M2:8GB:7:1:2 -w M3:8GB:7:1:2
    
#Cores stream_mem_avx_fma
(MByte/s)
% of Max
1 16063.80 13 %
2 31227.47 25 %
3 47110.67 37 %
4 64245.38 51 %
8 102775.05 81 %
16 122272.93 91 %
28 126187.02 100 %

=> 1 core can only get about 1/8 of full memory bandwidth
=> 4 cores are needed for 1/2 of full memory bandwidth
=> all cores are needed for full memory bandwidth

  • Loop over memory size used (L1 (32 kB), L2 (256 kB), L3 (25 MB))
    MEM_SIZE=2
    while (( MEM_SIZE <= 16*1024*1024 ))
    do
        likwid-bench \
            -t stream_mem_avx_fma \
            -i 100 \
            -w M0:${MEM_SIZE}KB:7:1:2 \
            -w M1:${MEM_SIZE}KB:7:1:2 \
            -w M2:${MEM_SIZE}KB:7:1:2 \
            -w M3:${MEM_SIZE}KB:7:1:2
        let MEM_SIZE*=2
    done 2>/dev/null |
       grep -e 'Size (Byte):' -e 'MByte/s:'
    
MEM_SIZE stream_mem_avx_fma
(MByte/s)
stream_avx_fma
(MByte/s)
stream_mem_avx_fma
/ stream_avx_fma
stream_avx_fma
/ stream_mem_avx_fma
10,5 18.750 26.672 0,70 1,42
21,0 29.354 121.348 0,24 4,13
52,5 49.569 111.199 0,45 2,24
115,5 85.302 310.242 0,27 3,64
241,5 84.466 303.598 0,28 3,59
493,5 88.457 1.168.594 0,08 13,21
997,5 89.229 897.323 0,10 10,06
1.995,0 115.272 866.554 0,13 7,52
3.990,0 149.004 849.090 0,18 5,70
7.990,5 152.373 385.012 0,40 2,53
15.991,5 177.235 231.882 0,76 1,31
31.993,5 184.341 221.023 0,83 1,20
63.997,5 197.247 361.096 0,55 1,83
127.995,0 167.116 105.212 1,59 0,63
255.990,0 129.356 99.404 1,30 0,77
511.990,5 128.616 99.228 1,30 0,77
1.023.991,5 128.589 99.264 1,30 0,77
2.047.993,5 128.013 99.239 1,29 0,78
4.095.997,5 124.954 97.184 1,29 0,78
8.191.995,0 123.248 95.195 1,29 0,77
16.383.990,0 124.524 94.926 1,31 0,76
32.767.990,5 124.579 94.747 1,31 0,76
65.535.991,5 124.994 94.375 1,32 0,76

=> stream_avx_fma does not use streaming stores. All stores go to the cache first. As long as you stay in the cache stream_avx_fma is much faster as stream_mem_avx_fma

=> stream_mem_avx_fma use streaming stores. All stores go directly to main memory. When you leave the cache stream_mem_avx_fma is faster than stream_avx_fma

Last modified 9 days ago Last modified on Apr 1, 2019, 12:48:37 PM