wiki:Tools/APS/example_rank_league

Example Intel Application Performance Snapshot: rank_league

  • Build rank_league benchmark
    module add compiler/intel/18.0
    module add mpi/impi/2018
    mpicc -Ofast -xHost -ipo rank_league.c -o rank_league
    
  • Jobscript rank_league.aps.job
    #!/usr/bin/bash
    #MSUB -l nodes=4:ppn=1
    #MSUB -l walltime=00:20:00
    
    # Prepare environment
    module purge
    module add compiler/intel/18.0
    module add mpi/impi/2018
    
    # Set up APS environment on cluster fh2, fh1
    module add devel/APS
    
    # rank_league options
    # test_type:   b - banwidth
    # output_type: s - statistics per rank - average, min, max
    # loop_num:    number of loops per every round
    RANK_LEAGUE_OPTIONS=( "-t=b" "-o=s" "-l=20000" )
    
    MPIRUN_OPTIONS=( "-print-rank-map" "-binding" "domain=core" )
    
    mpirun "${MPIRUN_OPTIONS[@]}" aps ./rank_league "${RANK_LEAGUE_OPTIONS[@]}"
    
  • Run benchmark rank_league with APS with batch system
    msub < rank_league.aps.job
    
  • Job output
    (fhcn0003:0)
    (fhcn0002:1)
    (fhcn0001:2)
    (fhcn0004:3)
    
    ****** Running bandwidth test ********
    Total number of rounds:          3
    Total number of loops per round: 20000
    Message size:                    100000
    **************************************
    
    **************************************
    RANK           MIN                MAX               AVERAGE
               RESULT  RANK        RESULT  RANK
    ___________________________________________________________
     0        5806.17   2          5916.02   3          5868.18
     1        5820.71   3          5961.95   0          5888.80
     2        5827.86   3          5927.72   1          5890.09
     3        5861.00   2          5913.43   1          5886.40
    ___________________________________________________________
    Global statistics:
     MIN     5806.17 between 0 and 2
     MAX     5961.95 between 1 and 0
     AVERAGE 5883.37
    
    Emon collector successfully stopped.
    Emon collector successfully stopped.
    Emon collector successfully stopped.
    Emon collector successfully stopped.
    Intel(R) Application Performance Snapshot 2018 collection completed successfully. Use the "aps --report=/pfs/data1/home/kit/scc/bq0742/cluster_performance_verification/src/rank_league/aps_result_20180327" command to generate textual and HTML reports for the profiling session.
    
  • APS text report:
    aps --report=/pfs/data1/home/kit/scc/bq0742/cluster_performance_verification/src/rank_league/aps_result_20180327
    | Summary information
    |--------------------------------------------------------------------
      Application                : rank_league
      Report creation date       : 2018-03-27 18:03:37
      Number of ranks            : 4
      Ranks per node             : 1
      HW Platform                : Intel(R) Xeon(R) E5/E7 v3 Processor code named Haswell
      Logical core count per node: 40
      Collector type             : Event-based counting driver
      Used statistics            : /pfs/data1/home/kit/scc/bq0742/cluster_performance_verification/src/rank_league/aps_result_20180327
    |
    | Your application is MPI bound.
    | This may be caused by high busy wait time inside the library (imbalance), non-optimal communication schema or MPI library settings. Use MPI profiling tools like Intel(R) Trace Analyzer and Collector to explore performance bottlenecks.
    |
      Elapsed time:                2.94 sec
      CPI Rate:                    0.68
      MPI Time:                    2.84 sec            96.56%
    | Your application is MPI bound. This may be caused by high busy wait time
    | inside the library (imbalance), non-optimal communication schema or MPI
    | library settings. Explore the MPI Imbalance metric if it is available or use
    | MPI profiling tools like Intel(R) Trace Analyzer and Collector to explore
    | possible performance bottlenecks.
        MPI Imbalance:             0.04 sec             1.31%
        Top 5 MPI functions (avg time):
            Waitall                      2.67 sec  (91.02 %)
            Isend                        0.03 sec  ( 0.86 %)
            Irecv                        0.02 sec  ( 0.62 %)
            Barrier                      0.01 sec  ( 0.24 %)
            Sendrecv                     0.00 sec  ( 0.01 %)
      Memory Stalls:                              51.73% of pipeline slots
    | The metric value can indicate that a significant fraction of execution
    | pipeline slots could be stalled due to demand memory load and stores. See the
    | second level metrics to define if the application is cache- or DRAM-bound and
    | the NUMA efficiency. Use Intel(R) VTune(TM) Amplifier Memory Access analysis
    | to review a detailed metric breakdown by memory hierarchy, memory bandwidth
    | information, and correlation by memory objects.
        Cache Stalls:                             82.30% of cycles
    | A significant proportion of cycles are spent on data fetches from cache. Use
    | Intel(R) VTune(TM) Amplifier Memory Access analysis to see if accesses to L2
    | or L3 cache are problematic and consider applying the same performance tuning
    | as you would for a cache-missing workload. This may include reducing the data
    | working set size, improving data access locality, blocking or partitioning the
    | working set to fit in the lower cache levels, or exploiting hardware
    | prefetchers.
        NUMA: % of Remote Accesses:                1.68%
        Average DRAM Bandwidth:                    0.10  GB/s
      I/O Bound:                  0.00 sec ( 0.00 %)
           Data read:             1.9  KB
           Data written:          4.7  KB
     Memory Footprint:
     Resident:
           Per node:
               Peak resident set size    :           35.08 MB (node fhcn0001.localdomain,s.*)
               Average resident set size :           35.04 MB
           Per rank:
               Peak resident set size    :           35.08 MB (rank 0)
               Average resident set size :           35.04 MB
     Virtual:
           Per node:
               Peak memory consumption    :          205.22 MB (node fhcn0003.localdomain,s.*)
               Average memory consumption :          205.22 MB
           Per rank:
               Peak memory consumption    :          205.22 MB (rank 0)
               Average memory consumption :          205.22 MB
    
    Graphical representation of this data is available in the HTML report: /pfs/data1/home/kit/scc/bq0742/cluster_performance_verification/src/rank_league/aps_report_20180327_180934.html
    
Last modified 26 hours ago Last modified on Apr 9, 2019, 11:44:40 AM