Example Intel Application Performance Snapshot: rank_league
- Build
rank_league
benchmarkmodule add compiler/intel/18.0 module add mpi/impi/2018 mpicc -Ofast -xHost -ipo rank_league.c -o rank_league
- Jobscript
rank_league.aps.job
#!/usr/bin/bash #MSUB -l nodes=4:ppn=1 #MSUB -l walltime=00:20:00 # Prepare environment module purge module add compiler/intel/18.0 module add mpi/impi/2018 # Set up APS environment on cluster fh2, fh1 module add devel/APS # rank_league options # test_type: b - banwidth # output_type: s - statistics per rank - average, min, max # loop_num: number of loops per every round RANK_LEAGUE_OPTIONS=( "-t=b" "-o=s" "-l=20000" ) MPIRUN_OPTIONS=( "-print-rank-map" "-binding" "domain=core" ) mpirun "${MPIRUN_OPTIONS[@]}" aps ./rank_league "${RANK_LEAGUE_OPTIONS[@]}"
- Run benchmark
rank_league
with APS with batch systemmsub < rank_league.aps.job
- Job output
(fhcn0003:0) (fhcn0002:1) (fhcn0001:2) (fhcn0004:3)
****** Running bandwidth test ******** Total number of rounds: 3 Total number of loops per round: 20000 Message size: 100000 ************************************** ************************************** RANK MIN MAX AVERAGE RESULT RANK RESULT RANK ___________________________________________________________ 0 5806.17 2 5916.02 3 5868.18 1 5820.71 3 5961.95 0 5888.80 2 5827.86 3 5927.72 1 5890.09 3 5861.00 2 5913.43 1 5886.40 ___________________________________________________________ Global statistics: MIN 5806.17 between 0 and 2 MAX 5961.95 between 1 and 0 AVERAGE 5883.37
Emon collector successfully stopped. Emon collector successfully stopped. Emon collector successfully stopped. Emon collector successfully stopped. Intel(R) Application Performance Snapshot 2018 collection completed successfully. Use the "aps --report=/pfs/data1/home/kit/scc/bq0742/cluster_performance_verification/src/rank_league/aps_result_20180327" command to generate textual and HTML reports for the profiling session.
- APS text report:
aps --report=/pfs/data1/home/kit/scc/bq0742/cluster_performance_verification/src/rank_league/aps_result_20180327 | Summary information |-------------------------------------------------------------------- Application : rank_league Report creation date : 2018-03-27 18:03:37 Number of ranks : 4 Ranks per node : 1 HW Platform : Intel(R) Xeon(R) E5/E7 v3 Processor code named Haswell Logical core count per node: 40 Collector type : Event-based counting driver Used statistics : /pfs/data1/home/kit/scc/bq0742/cluster_performance_verification/src/rank_league/aps_result_20180327 | | Your application is MPI bound. | This may be caused by high busy wait time inside the library (imbalance), non-optimal communication schema or MPI library settings. Use MPI profiling tools like Intel(R) Trace Analyzer and Collector to explore performance bottlenecks. | Elapsed time: 2.94 sec CPI Rate: 0.68 MPI Time: 2.84 sec 96.56% | Your application is MPI bound. This may be caused by high busy wait time | inside the library (imbalance), non-optimal communication schema or MPI | library settings. Explore the MPI Imbalance metric if it is available or use | MPI profiling tools like Intel(R) Trace Analyzer and Collector to explore | possible performance bottlenecks. MPI Imbalance: 0.04 sec 1.31% Top 5 MPI functions (avg time): Waitall 2.67 sec (91.02 %) Isend 0.03 sec ( 0.86 %) Irecv 0.02 sec ( 0.62 %) Barrier 0.01 sec ( 0.24 %) Sendrecv 0.00 sec ( 0.01 %) Memory Stalls: 51.73% of pipeline slots | The metric value can indicate that a significant fraction of execution | pipeline slots could be stalled due to demand memory load and stores. See the | second level metrics to define if the application is cache- or DRAM-bound and | the NUMA efficiency. Use Intel(R) VTune(TM) Amplifier Memory Access analysis | to review a detailed metric breakdown by memory hierarchy, memory bandwidth | information, and correlation by memory objects. Cache Stalls: 82.30% of cycles | A significant proportion of cycles are spent on data fetches from cache. Use | Intel(R) VTune(TM) Amplifier Memory Access analysis to see if accesses to L2 | or L3 cache are problematic and consider applying the same performance tuning | as you would for a cache-missing workload. This may include reducing the data | working set size, improving data access locality, blocking or partitioning the | working set to fit in the lower cache levels, or exploiting hardware | prefetchers. NUMA: % of Remote Accesses: 1.68% Average DRAM Bandwidth: 0.10 GB/s I/O Bound: 0.00 sec ( 0.00 %) Data read: 1.9 KB Data written: 4.7 KB Memory Footprint: Resident: Per node: Peak resident set size : 35.08 MB (node fhcn0001.localdomain,s.*) Average resident set size : 35.04 MB Per rank: Peak resident set size : 35.08 MB (rank 0) Average resident set size : 35.04 MB Virtual: Per node: Peak memory consumption : 205.22 MB (node fhcn0003.localdomain,s.*) Average memory consumption : 205.22 MB Per rank: Peak memory consumption : 205.22 MB (rank 0) Average memory consumption : 205.22 MB Graphical representation of this data is available in the HTML report: /pfs/data1/home/kit/scc/bq0742/cluster_performance_verification/src/rank_league/aps_report_20180327_180934.html
Last modified 26 hours ago
Last modified on Apr 9, 2019, 11:44:40 AM