Example likwid-perfctr
performance group FLOPS_AVX
on benchmark dgemm
- Build
dgemm
benchmarkmodule add \ compiler/intel/18.0 \ numlib/mkl/2018 icc \ -O2 -qopenmp -xHost -ipo \ -DUSE_MKL -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core \ mt-dgemm.c -o dgemm
- Set up OpenMP and MKL environment
export MKL_NUM_THREADS=20 export OMP_NUM_THREADS=20 export KMP_AFFINITY=verbose,granularity=core,respect,scatter
- List available performance groups
likwid-perfctr -a
... FLOPS_AVX Packed AVX MFLOP/s FLOPS_DP Double Precision MFLOP/s FLOPS_SP Single Precision MFLOP/s ...
- Get detailed information on performance group
FLOPS_AVX
likwid-perfctr -H --group FLOPS_AVX
Group FLOPS_AVX: Formula: Packed SP MFLOP/s = 1.0E-06*(FP_ARITH_INST_RETIRED_256B_PACKED_SINGLE*8)/runtime Packed DP MFLOP/s = 1.0E-06*(FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE*4)/runtime - FLOP rates of 256 bit packed floating-point instructions
- Messure performance group
FLOPS_AVX
on CPU 0 to 19likwid-perfctr \ --group FLOPS_AVX \ -C 0-19 \ ./dgemm 6000
-------------------------------------------------------------------------------- CPU name: Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz CPU type: Intel Xeon Haswell EN/EP/EX processor CPU clock: 2.60 GHz -------------------------------------------------------------------------------- Matrix size input by command line: 6000 Repeat multiply defaulted to 30 Alpha = 1.000000 Beta = 1.000000 Allocating Matrices... Allocation complete, populating with values... Performing multiplication... Calculating matrix check... =============================================================== Final Sum is: 6000.033333 -> Solution check PASSED successfully. Memory for Matrices: 823.974609 MB Multiply time: 17.463024 seconds FLOPs computed: 12962160000000.000000 GFLOP/s rate: 742.263190 GF/s ===============================================================
... +---------------------------+-------------+------------+------------+------------+ | Metric | Sum | Min | Max | Avg | +---------------------------+-------------+------------+------------+------------+ | Runtime (RDTSC) [s] STAT | 365.4460 | 18.2723 | 18.2723 | 18.2723 | | Runtime unhalted [s] STAT | 365.5165 | 18.1943 | 19.3181 | 18.2758 | | Clock [MHz] STAT | 58021.3801 | 2900.0175 | 2920.9291 | 2901.0690 | | CPI STAT | 6.2042 | 0.3095 | 0.3139 | 0.3102 | | Packed SP MFLOP/s STAT | 717311.3153 | 35865.5655 | 35865.5708 | 35865.5658 | | Packed DP MFLOP/s STAT | 358655.6568 | 17932.7827 | 17932.7854 | 17932.7828 | +---------------------------+-------------+------------+------------+------------+
- Validity check
Packed DP MFLOP/s STAT: 358655 Fused-Muliply-Add DP MFLOP/s : 358655 * 2 DP MFLOP/s : 717310 DP MFLOP/s : 717.310 DP GFLOP/s GFLOP/s rate: : 742.263190 DP GFLOP/s
=> You have to know the AVX assembler operations executed to compute the correct FLOP/s rate.
Last modified 12 months ago
Last modified on Apr 5, 2018, 5:16:19 PM