Tools/APS/example_dgemm
Example
Intel Application Performance Snapshot: dgemm
Build
dgemm
benchmarkmodule add \ \ compiler/intel/2022 numlib/mkl/2022icc -O2 -qopenmp -xHost -ipo \ -DUSE_MKL -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core \ -o dgemm timing.c stats.c matrix_common.c dgemm.multithread.c
Set up APS environment on cluster HoreKa
# Standalone source /software/all/toolkit/Intel_OneAPI/vtune/latest/apsvars.sh # or as part of Intel VTune module add devel/vtune/2022
Set up APS environment on cluster BwUniCluster 2.0
source /software/all/toolkit/Intel_OneAPI/vtune/latest/apsvars.sh
Run benchmark
dgemm
with APSexport OMP_NUM_THREADS=76 export MKL_NUM_THREADS=76 export KMP_AFFINITY=verbose,granularity=core,respect,scatter aps dgemm -n 8000
Number of repetitions set to 30. Overwrite with command line option -m. Matrix size: 8000 Repeat multiply 30 times. Alpha = 1.000000 Beta = 1.000000 Allocating Matrices... Allocation complete, populating with values... Performing multiplication... Calculating matrix check... =============================================================== || E ||_∞: 0.000000E+00 -> Solution check PASSED successfully. Memory for Matrices: 1464.843750 MB Multiply time: 6.406013 seconds FLOPs computed: 30723840000000.000000 Min GFLOP/s: 4469.474308 GF/s Max GFLOP/s: 4930.761834 GF/s Average GFLOP/s: 4799.236530 GF/s Std. dev. GFLOP/s: 657.393548 GF/s Median GFLOP/s: 4846.325150 GF/s MAD GFLOP/s: 28.129912 GF/s ===============================================================
Intel(R) oneAPI VTune(TM) Profiler 2022.0.0 collection completed successfully. Use the "aps --report <...>/aps_result_20220602" command to generate textual and HTML reports for the profiling session.
Generate APS report
aps --report <...>/aps_result_20220602
Loading 100.00% | Summary information |-------------------------------------------------------------------- Application : dgemm Report creation date : 2022-06-02 10:52:05 OpenMP threads number per Process: 76 HW Platform : Intel(R) Xeon(R) Processor code named Icelake Frequency : 2.39 GHz Logical core count per node : 152 Collector type : Driverless Perf per-process counting Used statistics : <...>/aps_result_20220602 | | Your application looks good. | Nothing suspicious has been detected. | Elapsed Time: 6.47 s SP GFLOPS: 0.00 DP GFLOPS: 4726.85 Average CPU Frequency: 2.51 GHz IPC Rate: 2.59 Serial Time: 0.06 s 0.94% of Elapsed Time OpenMP Imbalance: 0.00 s 0.03% of Elapsed Time Memory Stalls: 18.30% of Pipeline Slots Cache Stalls: 8.50% of Cycles DRAM Stalls: 3.30% of Cycles Average DRAM Bandwidth: N/A | Data for this metric is not collected since it requires system-wide | performance monitoring. Make sure the sampling driver is properly installed on | your system: https://software.intel.com/en-us/vtune-amplifier-help-sep-driver. | Otherwise, enable a driverless Perf-based sampling collection by setting the | /proc/sys/kernel/perf_even_paranoid value to 0 or less. NUMA: 49.00% of Remote Accesses | A significant amount of DRAM loads was serviced from remote DRAM. Wherever | possible, consistently use data on the same core, or at least the same | package, as it was allocated on. Vectorization: 100.00% Instruction Mix: SP FLOPs: 0.00% of uOps DP FLOPs: 100.00% of uOps Packed: 100.00% from DP FP 128-bit: 0.00% 256-bit: 0.00% 512-bit: 100.00% Scalar: 0.00% from DP FP Non-FP: 0.00% of uOps FP Arith/Mem Rd Instr. Ratio: 3.80 FP Arith/Mem Wr Instr. Ratio: 434.14 Memory Footprint: Resident: 1577.00 MB Virtual: 7031.00 MB Graphical representation of this data is available in the HTML report: <...>/aps_report_20220602_105518.html