wiki:Tools/likwid/example_perfctr_dgemm

Example likwid-perfctr performance group FLOPS_AVX on benchmark dgemm

  • Build dgemm benchmark
    module add \
        compiler/intel/18.0 \
        numlib/mkl/2018
    icc \
        -O2 -qopenmp -xHost -ipo \
        -DUSE_MKL -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core \
        mt-dgemm.c -o dgemm
    
  • Set up OpenMP and MKL environment
    export MKL_NUM_THREADS=20
    export OMP_NUM_THREADS=20
    export KMP_AFFINITY=verbose,granularity=core,respect,scatter
    
  • List available performance groups
    likwid-perfctr -a
    
    ...
      FLOPS_AVX     Packed AVX MFLOP/s
       FLOPS_DP     Double Precision MFLOP/s
       FLOPS_SP     Single Precision MFLOP/s
    ...
    
  • Get detailed information on performance group FLOPS_AVX
    likwid-perfctr -H --group FLOPS_AVX
    
    Group FLOPS_AVX:
    Formula:
    Packed SP MFLOP/s = 1.0E-06*(FP_ARITH_INST_RETIRED_256B_PACKED_SINGLE*8)/runtime
    Packed DP MFLOP/s = 1.0E-06*(FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE*4)/runtime
    -
    FLOP rates of 256 bit packed floating-point instructions
    
  • Messure performance group FLOPS_AVX on CPU 0 to 19
    likwid-perfctr \
        --group FLOPS_AVX \
        -C 0-19 \
        ./dgemm 6000
    
    --------------------------------------------------------------------------------
    CPU name:       Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz
    CPU type:       Intel Xeon Haswell EN/EP/EX processor
    CPU clock:      2.60 GHz
    --------------------------------------------------------------------------------
    Matrix size input by command line: 6000
    Repeat multiply defaulted to 30
    Alpha =    1.000000
    Beta  =    1.000000
    Allocating Matrices...
    Allocation complete, populating with values...
    Performing multiplication...
    Calculating matrix check...
    
    ===============================================================
    Final Sum is:         6000.033333
     -> Solution check PASSED successfully.
    Memory for Matrices:  823.974609 MB
    Multiply time:        17.463024 seconds
    FLOPs computed:       12962160000000.000000
    GFLOP/s rate:         742.263190 GF/s
    ===============================================================
    
    ...
    +---------------------------+-------------+------------+------------+------------+
    |           Metric          |     Sum     |     Min    |     Max    |     Avg    |
    +---------------------------+-------------+------------+------------+------------+
    |  Runtime (RDTSC) [s] STAT |    365.4460 |    18.2723 |    18.2723 |    18.2723 |
    | Runtime unhalted [s] STAT |    365.5165 |    18.1943 |    19.3181 |    18.2758 |
    |      Clock [MHz] STAT     |  58021.3801 |  2900.0175 |  2920.9291 |  2901.0690 |
    |          CPI STAT         |      6.2042 |     0.3095 |     0.3139 |     0.3102 |
    |   Packed SP MFLOP/s STAT  | 717311.3153 | 35865.5655 | 35865.5708 | 35865.5658 |
    |   Packed DP MFLOP/s STAT  | 358655.6568 | 17932.7827 | 17932.7854 | 17932.7828 |
    +---------------------------+-------------+------------+------------+------------+
    
  • Validity check
    Packed DP MFLOP/s STAT: 358655 Fused-Muliply-Add DP MFLOP/s
                          : 358655 * 2               DP MFLOP/s
                          : 717310                   DP MFLOP/s
                          : 717.310                  DP GFLOP/s
    
    GFLOP/s rate:         : 742.263190               DP GFLOP/s
    

=> You have to know the AVX assembler operations executed to compute the correct FLOP/s rate.

Last modified 12 months ago Last modified on Apr 5, 2018, 5:16:19 PM