wiki:Tools/perf/example_dgemm

Example perf: dgemm

  • Build dgemm benchmark
    module add compiler/intel/18.0
    module add numlib/mkl/2018
    icc -O2 -qopenmp -xHost -ipo \
        -DUSE_MKL -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core \
        mt-dgemm.c -o dgemm
    
  • Set up OpenMP and MKL environment
    export OMP_NUM_THREADS=20
    export KMP_AFFINITY=granularity=core,respect,scatter
    export MKL_NUM_THREADS=20
    export MKL_DYNAMIC=false
    
  • List available performance counters
    perf list
    
    ...
      avx_insts.all                                     
           [Approximate counts of AVX & AVX2 256-bit instructions, including non-arithmetic instructions, loads, and stores. May count non-AVX instructions that employ
            256-bit operations, including (but not necessarily limited to) rep string instructions that use 256-bit loads and stores for optimized performance, XSAVE* and
            XRSTOR*, and operations that transition the x87 FPU data registers between x87 and MMX]
    ...
    
  • Get performance statistics for benchmark dgemm
    perf stat \
        --event=avx_insts.all \
        --event=inst_retired.any \
        --event=cpu-clock \
        --event=cpu-cycles \
        --event=cpu-migrations \
        dgemm 8000
    
    Matrix size input by command line: 8000
    Repeat multiply defaulted to 30
    Alpha =    1.000000
    Beta  =    1.000000
    Allocating Matrices...
    Allocation complete, populating with values...
    Performing multiplication...
    Calculating matrix check...
    
    ===============================================================
    Final Sum is:         8000.033333
     -> Solution check PASSED successfully.
    Memory for Matrices:  1464.843750 MB
    Multiply time:        41.262948 seconds
    FLOPs computed:       30723840000000.000000
    GFLOP/s rate:         744.586644 GF/s
    ===============================================================
    
     Performance counter stats for 'dgemm 8000':
    
     6,171,915,951,880      avx_insts.all             # 7547.824 M/sec                  
     7,255,569,552,901      inst_retired.any          # 8873.057 M/sec                  
         817708.001572      cpu-clock (msec)          #   19.673 CPUs utilized          
     2,312,060,934,724      cpu-cycles                #    2.827 GHz                    
                    90      cpu-migrations            #    0.000 K/sec                  
    
          41.565604465 seconds time elapsed
    
    • avx_insts.all / cpu-cycles ~ 2,67
    • inst_retired.any / cpu-cycles ~ 3,14
  • Record performance data of benchmark dgemm for use with perf report and perf annotate
    perf record dgemm 8000
    
    ===============================================================
    Final Sum is:         8000.033333
     -> Solution check PASSED successfully.
    Memory for Matrices:  1464.843750 MB
    Multiply time:        65.616777 seconds
    FLOPs computed:       30723840000000.000000
    GFLOP/s rate:         468.231471 GF/s
    ===============================================================
    
    [ perf record: Woken up 3 times to write data ]
    [ perf record: Captured and wrote 200.360 MB perf.data (5232441 samples) ]
    
  • Reduce recording overhead
    perf record --freq=100 dgemm 8000
    
    ===============================================================
    Final Sum is:         8000.033333
     -> Solution check PASSED successfully.
    Memory for Matrices:  1464.843750 MB
    Multiply time:        40.269861 seconds
    FLOPs computed:       30723840000000.000000
    GFLOP/s rate:         762.948748 GF/s
    ===============================================================
    
    [ perf record: Woken up 1 times to write data ]
    [ perf record: Captured and wrote 3.124 MB perf.data (81372 samples) ]
    
  • Create performance report
    perf report
    
      94.08%  dgemm    libmkl_avx2.so          [.] mkl_blas_avx2_dgemm_kernel_0
       0.95%  dgemm    [kernel.kallsyms]       [k] native_queued_spin_lock_slowpath
       0.94%  dgemm    libmkl_avx2.so          [.] mkl_blas_avx2_dgemm_dcopy_down12_ea
       0.81%  dgemm    libmkl_avx2.so          [.] mkl_blas_avx2_dgemm_dcopy_right4_ea
    ...
    
    • Interactive navigation in performance report:
      • h: get help
      • a: jump to annotated assembler code
  • -> Most of the time is spent in only one MKL function
  • -> perf annotate can only show MKL assembler code
Last modified 12 months ago Last modified on Apr 4, 2018, 11:06:44 AM