Tools/APS/example_rank_league
Example
Intel Application Performance Snapshot: rank_league
Build
rank_league
benchmarkmodule add \ \ compiler/intel/2022 mpi/impi/2021.5.1mpicc -Ofast -xHost -ipo rank_league.c -o rank_league
Jobscript
rank_league.aps.job
#!/usr/bin/bash #SBATCH --partition=<...> #SBATCH --nodes=4 #SBATCH --tasks-per-node=1 #SBATCH --time=20 # Prepare environment module purge module add \ \ compiler/intel/2022 mpi/impi/2021.5.1 # Set up APS environment on cluster HoreKa source /software/all/toolkit/Intel_OneAPI/vtune/latest/apsvars.sh # rank_league options # test_type: b - banwidth # output_type: s - statistics per rank - average, min, max # loop_num: number of loops per every round RANK_LEAGUE_OPTIONS=( "-t=b" "-o=s" "-l=20000" ) MPIRUN_OPTIONS=( "-print-rank-map" "-binding" "domain=core" ) mpirun "${MPIRUN_OPTIONS[@]}" aps ./rank_league "${RANK_LEAGUE_OPTIONS[@]}"
Run benchmark
rank_league
with APS with batch systemsbatch < rank_league.aps.job
Job output
(hkn0802:0) (hkn0803:1) (hkn0804:2) (hkn0805:3)
****** Running bandwidth test ******** Total number of rounds: 3 Total number of loops per round: 20000 Message size: 100000 ************************************** Round number 3 ************************************** RANK MIN MAX AVERAGE RESULT RANK RESULT RANK ___________________________________________________________ 0 9700.63 2 16436.13 1 13765.18 1 9539.31 0 16719.28 3 12048.05 2 9740.46 0 16543.57 3 14191.70 3 9552.40 2 16651.53 1 11945.14 ___________________________________________________________ Global statistics: MIN 9539.31 between 1 and 0 MAX 16719.28 between 1 and 3 AVERAGE 12987.51
Intel(R) oneAPI VTune(TM) Profiler 2022.0.0 collection completed successfully. Use the "aps --report <...>/aps_result_20220602" command to generate textual and HTML reports for the profiling session.
Generate APS report:
aps --report <...>/aps_result_20220602
Loading 100.00% | Summary information |-------------------------------------------------------------------- Application : rank_league Report creation date : 2022-06-02 11:28:20 Number of ranks : 4 Ranks per node : 1 HW Platform : Intel(R) Xeon(R) Processor code named Icelake Frequency : 2.39 GHz Logical core count per node : 152 Collector type : Driverless Perf per-process counting Used statistics : <...>/aps_result_20220602 | | Your application is MPI bound. | This may be caused by high busy wait time inside the library (imbalance), non-optimal communication schema or MPI library settings. Use MPI profiling tools like Intel(R) Trace Analyzer and Collector to explore performance bottlenecks. | Elapsed Time: 1.87 s SP GFLOPS: 0.00 DP GFLOPS: 0.00 Average CPU Frequency: 3.19 GHz IPC Rate: 1.09 | Some of the individual values contributing to this average metric broke the | issue threshold of the metric. | Please use --counters or --metrics="Instructions Per Cycle Rate" reports for | details. MPI Time: 1.73 s 93.63% of Elapsed Time | Your application is MPI bound. This may be caused by high busy wait time | inside the library (imbalance), non-optimal communication schema or MPI | library settings. Explore the MPI Imbalance metric if it is available or use | MPI profiling tools like Intel(R) Trace Analyzer and Collector to explore | possible performance bottlenecks. MPI Imbalance: 0.01 s 0.32% of Elapsed Time Top 5 MPI functions (avg time): MPI_Isend: 0.67 s 36.21% of Elapsed Time MPI_Irecv: 0.63 s 34.02% of Elapsed Time MPI_Init: 0.28 s 15.04% of Elapsed Time MPI_Barrier: 0.11 s 5.78% of Elapsed Time MPI_Waitall: 0.04 s 2.39% of Elapsed Time Memory Stalls: 47.15% of Pipeline Slots | The metric value can indicate that a significant fraction of execution | pipeline slots could be stalled due to demand memory load and stores. See the | second level metrics to define if the application is cache- or DRAM-bound and | the NUMA efficiency. Use Intel(R) VTune(TM) Profiler Memory Access analysis to | review a detailed metric breakdown by memory hierarchy, memory bandwidth | information, and correlation by memory objects. Cache Stalls: 29.52% of Cycles | A significant proportion of cycles are spent on data fetches from cache. Use | Intel(R) VTune(TM) Profiler Memory Access analysis to see if accesses to L2 or | L3 cache are problematic and consider applying the same performance tuning as | you would for a cache-missing workload. This may include reducing the data | working set size, improving data access locality, blocking or partitioning the | working set to fit in the lower cache levels, or exploiting hardware | prefetchers. DRAM Stalls: 17.82% of Cycles | Some of the individual values contributing to this average metric broke the | issue threshold of the metric. | Please use --counters or --metrics="DRAM Stalls" reports for details. Average DRAM Bandwidth: N/A | Data for this metric is not collected since it requires system-wide | performance monitoring. Make sure the sampling driver is properly installed on | your system: https://software.intel.com/en-us/vtune-amplifier-help-sep-driver. | Otherwise, enable a driverless Perf-based sampling collection by setting the | /proc/sys/kernel/perf_even_paranoid value to 0 or less. NUMA: 0.38% of Remote Accesses | Some of the individual values contributing to this average metric are | statistical outliers that can significantly distort the average metric value. | They can also be a cause of performance degradation. | Please use --counters or --metrics="NUMA" reports for details. Vectorization: 0.00% Instruction Mix: SP FLOPs: 0.00% of uOps DP FLOPs: 0.00% of uOps Non-FP: 100.00% of uOps FP Arith/Mem Rd Instr. Ratio: 0.00 FP Arith/Mem Wr Instr. Ratio: 0.00 Disk I/O Bound: 0.00 s Memory Footprint: Resident: Per node: Peak resident set size : 120.00 MB (node hkn0804.localdomain) Average resident set size : 117.75 MB Per rank: Peak resident set size : 120.00 MB (rank 2) Average resident set size : 117.75 MB Virtual: Per node: Peak memory consumption : 261.00 MB (node hkn0802.localdomain) Average memory consumption : 259.00 MB Per rank: Peak memory consumption : 261.00 MB (rank 0) Average memory consumption : 259.00 MB Graphical representation of this data is available in the HTML report: <...>/aps_report_20220602_113050.html