Example Intel Application Performance Snapshot: dgemm
- Build
dgemm
benchmarkmodule add compiler/intel/18.0 module add numlib/mkl/2018 icc -O2 -qopenmp -xHost -ipo \ -DUSE_MKL -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core \ mt-dgemm.c -o dgemm
- Set up APS environment on cluster fh2, fh1
module add devel/APS
- Set up APS environment on cluster uc1
source /opt/bwhpc/common/devel/aps/2019/apsvars.sh
- Run benchmark
dgemm
with APSexport OMP_NUM_THREADS=20 export MKL_NUM_THREADS=20 export KMP_AFFINITY=verbose,granularity=core,respect,scatter aps dgemm 8000
- Output
Matrix size input by command line: 8000 Repeat multiply defaulted to 30 Alpha = 1.000000 Beta = 1.000000 Allocating Matrices... Allocation complete, populating with values... Performing multiplication... Calculating matrix check... =============================================================== Final Sum is: 8000.033333 -> Solution check PASSED successfully. Memory for Matrices: 1464.843750 MB Multiply time: 40.728484 seconds FLOPs computed: 30723840000000.000000 GFLOP/s rate: 754.357566 GF/s ===============================================================
Emon collector successfully stopped. | Summary information |-------------------------------------------------------------------- Application : dgemm Report creation date : 2018-03-27 16:57:06 OpenMP threads number : 20 HW Platform : Intel(R) Xeon(R) E5/E7 v3 Processor code named Haswell Logical core count per node: 40 Collector type : Event-based counting driver Used statistics : aps_result_20180327 | | Your application looks good. | Nothing suspicious has been detected. | Elapsed time: 41.07 sec CPI Rate: 0.31 Serial Time: 0.79 sec 1.93% OpenMP Imbalance: 0.00 sec 0.01% Memory Stalls: 16.90% of pipeline slots Cache Stalls: 93.20% of cycles | A significant proportion of cycles are spent on data fetches from cache. Use | Intel(R) VTune(TM) Amplifier Memory Access analysis to see if accesses to L2 | or L3 cache are problematic and consider applying the same performance tuning | as you would for a cache-missing workload. This may include reducing the data | working set size, improving data access locality, blocking or partitioning the | working set to fit in the lower cache levels, or exploiting hardware | prefetchers. NUMA: % of Remote Accesses: 32.10% | A significant amount of DRAM loads was serviced from remote DRAM. Wherever | possible, consistently use data on the same core, or at least the same | package, as it was allocated on. Average DRAM Bandwidth: 35.56 GB/s Memory Footprint: Resident: 1494.34 MB Virtual: 2937.48 MB Graphical representation of this data is available in the HTML report: /pfs/data1/home/kit/scc/bq0742/cluster_performance_verification/src/mt-dgemm/aps_report_20180327_165750.html
Last modified 6 days ago
Last modified on Apr 4, 2019, 11:34:09 AM