Example perf: dgemm
- Build
dgemm
benchmarkmodule add compiler/intel/18.0 module add numlib/mkl/2018 icc -O2 -qopenmp -xHost -ipo \ -DUSE_MKL -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core \ mt-dgemm.c -o dgemm
- Set up OpenMP and MKL environment
export OMP_NUM_THREADS=20 export KMP_AFFINITY=granularity=core,respect,scatter export MKL_NUM_THREADS=20 export MKL_DYNAMIC=false
- List available performance counters
perf list
... avx_insts.all [Approximate counts of AVX & AVX2 256-bit instructions, including non-arithmetic instructions, loads, and stores. May count non-AVX instructions that employ 256-bit operations, including (but not necessarily limited to) rep string instructions that use 256-bit loads and stores for optimized performance, XSAVE* and XRSTOR*, and operations that transition the x87 FPU data registers between x87 and MMX] ...
- Get performance statistics for benchmark
dgemm
perf stat \ --event=avx_insts.all \ --event=inst_retired.any \ --event=cpu-clock \ --event=cpu-cycles \ --event=cpu-migrations \ dgemm 8000
Matrix size input by command line: 8000 Repeat multiply defaulted to 30 Alpha = 1.000000 Beta = 1.000000 Allocating Matrices... Allocation complete, populating with values... Performing multiplication... Calculating matrix check... =============================================================== Final Sum is: 8000.033333 -> Solution check PASSED successfully. Memory for Matrices: 1464.843750 MB Multiply time: 41.262948 seconds FLOPs computed: 30723840000000.000000 GFLOP/s rate: 744.586644 GF/s ===============================================================
Performance counter stats for 'dgemm 8000': 6,171,915,951,880 avx_insts.all # 7547.824 M/sec 7,255,569,552,901 inst_retired.any # 8873.057 M/sec 817708.001572 cpu-clock (msec) # 19.673 CPUs utilized 2,312,060,934,724 cpu-cycles # 2.827 GHz 90 cpu-migrations # 0.000 K/sec 41.565604465 seconds time elapsed
- avx_insts.all / cpu-cycles ~ 2,67
- inst_retired.any / cpu-cycles ~ 3,14
- Record performance data of benchmark
dgemm
for use withperf report
andperf annotate
perf record dgemm 8000
=============================================================== Final Sum is: 8000.033333 -> Solution check PASSED successfully. Memory for Matrices: 1464.843750 MB Multiply time: 65.616777 seconds FLOPs computed: 30723840000000.000000 GFLOP/s rate: 468.231471 GF/s ===============================================================
[ perf record: Woken up 3 times to write data ] [ perf record: Captured and wrote 200.360 MB perf.data (5232441 samples) ]
- Reduce recording overhead
perf record --freq=100 dgemm 8000
=============================================================== Final Sum is: 8000.033333 -> Solution check PASSED successfully. Memory for Matrices: 1464.843750 MB Multiply time: 40.269861 seconds FLOPs computed: 30723840000000.000000 GFLOP/s rate: 762.948748 GF/s =============================================================== [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 3.124 MB perf.data (81372 samples) ]
- Create performance report
perf report
94.08% dgemm libmkl_avx2.so [.] mkl_blas_avx2_dgemm_kernel_0 0.95% dgemm [kernel.kallsyms] [k] native_queued_spin_lock_slowpath 0.94% dgemm libmkl_avx2.so [.] mkl_blas_avx2_dgemm_dcopy_down12_ea 0.81% dgemm libmkl_avx2.so [.] mkl_blas_avx2_dgemm_dcopy_right4_ea ...
- Interactive navigation in performance report:
h
: get helpa
: jump to annotated assembler code
- Interactive navigation in performance report:
- -> Most of the time is spent in only one MKL function
- ->
perf annotate
can only show MKL assembler code
Last modified 12 months ago
Last modified on Apr 4, 2018, 11:06:44 AM