Tools/likwid/example_perfctr_dgemm
Example
likwid-perfctr
performance group FLOPS_AVX
on
benchmark dgemm
Build
dgemm
benchmarkmodule add \ \ compiler/intel/2022 numlib/mkl/2022 icc \ -O2 -qopenmp -xHost -ipo \ -DUSE_MKL -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core \ -o dgemm timing.c stats.c matrix_common.c dgemm.multithread.c
Set up OpenMP and MKL environment
export MKL_NUM_THREADS=76 export OMP_NUM_THREADS=76 export KMP_AFFINITY=verbose,granularity=fine,respect,scatter
List available performance groups
likwid-perfctr -a
... FLOPS_AVX Packed AVX MFLOP/s FLOPS_DP Double Precision MFLOP/s FLOPS_SP Single Precision MFLOP/s ...
Get detailed information on performance group
FLOPS_AVX
likwid-perfctr -H --group FLOPS_AVX
Group FLOPS_AVX: Formulas: Packed SP [MFLOP/s] = 1.0E-06*(FP_ARITH_INST_RETIRED_256B_PACKED_SINGLE*8+FP_ARITH_INST_RETIRED_512B_PACKED_SINGLE*16)/runtime Packed DP [MFLOP/s] = 1.0E-06*(FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE*4+FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE*8)/runtime - Packed 32b AVX FLOPs rates.
Measure performance group
FLOPS_AVX
on CPU Hyperthread 0 to 151likwid-perfctr \ --group FLOPS_AVX \ -c 0-151 \ -n 8000 ./dgemm
-------------------------------------------------------------------------------- CPU name: Intel(R) Xeon(R) Platinum 8368 CPU @ 2.40GHz CPU type: Intel Icelake SP processor CPU clock: 2.39 GHz --------------------------------------------------------------------------------
Number of repetitions set to 30. Overwrite with command line option -m. Matrix size: 8000 Repeat multiply 30 times. Alpha = 1.000000 Beta = 1.000000 Allocating Matrices... Allocation complete, populating with values... Performing multiplication... Calculating matrix check... =============================================================== || E ||_∞: 0.000000E+00 -> Solution check PASSED successfully. Memory for Matrices: 1464.843750 MB Multiply time: 6.370304 seconds FLOPs computed: 30723840000000.000000 Min GFLOP/s: 4473.591902 GF/s Max GFLOP/s: 4950.916234 GF/s Average GFLOP/s: 4825.537147 GF/s Std. dev. GFLOP/s: 597.602110 GF/s Median GFLOP/s: 4870.102613 GF/s MAD GFLOP/s: 38.695209 GF/s ===============================================================
... +---------------------------+--------------+--------------+------------+------------+ | Metric | Sum | Min | Max | Avg | +---------------------------+--------------+--------------+------------+------------+ | Runtime (RDTSC) [s] STAT | 968.6808 | 6.3729 | 6.3729 | 6.3729 | | Runtime unhalted [s] STAT | 495.9688 | 1.188787e-05 | 6.6924 | 3.2630 | | Clock [MHz] STAT | 405351.9133 | 2182.7358 | 3242.0543 | 2666.7889 | | CPI STAT | 467.1632 | 0.3788 | 16.9742 | 3.0734 | | Packed SP [MFLOP/s] STAT | 0 | 0 | 0 | 0 | | Packed DP [MFLOP/s] STAT | 4.826735e+06 | 0 | 66241.9113 | 31754.8333 | +---------------------------+--------------+--------------+------------+------------+
Validity check
Packed DP [MFLOP/s] STAT: 4.826735e+06 MFLOP/s = 4826.735 GFLOP/s Average GFLOP/s: 4825.537147 GFLOP/s
=> The specified FLOP/s may overestimate the actual FLOP/s, since the AVX registers may not always be fully loaded