wiki:Tools/APS/example_dgemm

Example Intel Application Performance Snapshot: dgemm

  • Build dgemm benchmark
    module add compiler/intel/18.0
    module add numlib/mkl/2018
    icc -O2 -qopenmp -xHost -ipo \
        -DUSE_MKL -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core \
        mt-dgemm.c -o dgemm
    
  • Set up APS environment on cluster fh2, fh1
    module add devel/APS
    
  • Set up APS environment on cluster uc1
    source /opt/bwhpc/common/devel/aps/2019/apsvars.sh
    
  • Run benchmark dgemm with APS
    export OMP_NUM_THREADS=20
    export MKL_NUM_THREADS=20
    export KMP_AFFINITY=verbose,granularity=core,respect,scatter
    
    aps dgemm 8000
    
  • Output
    Matrix size input by command line: 8000
    Repeat multiply defaulted to 30
    Alpha =    1.000000
    Beta  =    1.000000
    Allocating Matrices...
    Allocation complete, populating with values...
    Performing multiplication...
    Calculating matrix check...
    
    ===============================================================
    Final Sum is:         8000.033333
     -> Solution check PASSED successfully.
    Memory for Matrices:  1464.843750 MB
    Multiply time:        40.728484 seconds
    FLOPs computed:       30723840000000.000000
    GFLOP/s rate:         754.357566 GF/s
    ===============================================================
    
    Emon collector successfully stopped.
    | Summary information
    |--------------------------------------------------------------------
      Application                : dgemm
      Report creation date       : 2018-03-27 16:57:06
      OpenMP threads number      : 20
      HW Platform                : Intel(R) Xeon(R) E5/E7 v3 Processor code named Haswell
      Logical core count per node: 40
      Collector type             : Event-based counting driver
      Used statistics            : aps_result_20180327
    |
    | Your application looks good.
    | Nothing suspicious has been detected.
    |
      Elapsed time:               41.07 sec
      CPI Rate:                    0.31
      Serial Time:                 0.79 sec             1.93%
      OpenMP Imbalance:            0.00 sec             0.01%
      Memory Stalls:                              16.90% of pipeline slots
        Cache Stalls:                             93.20% of cycles
    | A significant proportion of cycles are spent on data fetches from cache. Use
    | Intel(R) VTune(TM) Amplifier Memory Access analysis to see if accesses to L2
    | or L3 cache are problematic and consider applying the same performance tuning
    | as you would for a cache-missing workload. This may include reducing the data
    | working set size, improving data access locality, blocking or partitioning the
    | working set to fit in the lower cache levels, or exploiting hardware
    | prefetchers.
        NUMA: % of Remote Accesses:               32.10%
    | A significant amount of DRAM loads was serviced from remote DRAM. Wherever
    | possible, consistently use data on the same core, or at least the same
    | package, as it was allocated on.
        Average DRAM Bandwidth:                   35.56  GB/s
     Memory Footprint:
      Resident:       1494.34 MB
      Virtual:        2937.48 MB
    
    Graphical representation of this data is available in the HTML report: /pfs/data1/home/kit/scc/bq0742/cluster_performance_verification/src/mt-dgemm/aps_report_20180327_165750.html
    
Last modified 6 days ago Last modified on Apr 4, 2019, 11:34:09 AM