wiki:Tools/APS/example_stream

Example Intel Application Performance Snapshot: stream

  • Build stream benchmark
    module add compiler/intel/18.0
    icc -std=c11 -Ofast -xHost -ipo -qopenmp \
        stream.c -o stream
    
  • Set up APS environment on cluster fh2, fh1
    module add devel/APS
    
  • Set up APS environment on cluster uc1
    source /opt/bwhpc/common/devel/aps/2019/apsvars.sh
    
  • Run benchmark stream with APS
    export OMP_NUM_THREADS=20
    export KMP_AFFINITY=verbose,granularity=core,respect,scatter
    aps ./stream -n 2500000000
    
  • Output
    -------------------------------------------------------------
    STREAM version $Revision: 5.10 $
    -------------------------------------------------------------
    This system uses 8 bytes per array element.
    -------------------------------------------------------------
    Array size = 2500000000 (elements) (elements)
    Memory per array = 19073.5 MiB (= 18.6 GiB).
    Total memory required = 57220.5 MiB (= 55.9 GiB).
    Each kernel will be executed 10 times.
     The *best* time for each kernel (excluding the first iteration)
     will be used to compute the reported bandwidth.
    -------------------------------------------------------------
    Number of Threads requested = 20
    Number of Threads counted = 20
    -------------------------------------------------------------
    Your clock granularity/precision appears to be 1 microseconds.
    Each test below will take on the order of 384517 microseconds.
       (= 384517 clock ticks)
    Increase the size of the arrays if this shows that
    you are not getting at least 20 clock ticks per test.
    -------------------------------------------------------------
    WARNING -- The above is only a rough guideline.
    For best results, please be sure you know the
    precision of your system timer.
    -------------------------------------------------------------
    Function    Best Rate MB/s  Med time     Min time     Max time
    Copy:          108408.0     0.459804     0.368976     0.489267
    Scale:         108619.0     0.466340     0.368260     0.504614
    Add:           109616.3     0.615987     0.547364     0.647817
    Triad:          98889.5     0.623797     0.606738     0.636934
    -------------------------------------------------------------
    Solution Validates: avg error less than 1.000000e-13 on all three arrays
    -------------------------------------------------------------
    
    Emon collector successfully stopped.
    | Summary information
    |--------------------------------------------------------------------
      Application                : stream
      Report creation date       : 2018-03-27 14:12:32
      OpenMP threads number      : 20
      HW Platform                : Intel(R) Xeon(R) E5/E7 v3 Processor code named Haswell
      Logical core count per node: 40
      Collector type             : Event-based counting driver
      Used statistics            : aps_result_20180327
    |
    | Your application is memory bound.
    | Use memory access analysis tools like Intel(R) VTune(TM) Amplifier for a  detailed metric breakdown by memory hierarchy, memory bandwidth, and correlation by memory objects.
    |
      Elapsed time:               32.23 sec
      CPI Rate:                    3.44
    | The CPI value may be too high.
    | This could be caused by such issues as memory stalls, instruction starvation,
    | branch misprediction, or long latency instructions.
    | Use Intel(R) VTune(TM) Amplifier General Exploration analysis to specify
    | particular reasons of high CPI.
      Serial Time:                 0.01 sec             0.03%
      OpenMP Imbalance:            4.04 sec            12.52%
    | The metric value can indicate significant time spent by threads waiting at
    | barriers. Consider using dynamic work scheduling to reduce the imbalance where
    | possible. Use Intel(R) VTune(TM) Amplifier HPC Performance Characterization
    | analysis to review imbalance data distributed by barriers of different lexical
    | regions.
      Memory Stalls:                              84.50% of pipeline slots
    | The metric value can indicate that a significant fraction of execution
    | pipeline slots could be stalled due to demand memory load and stores. See the
    | second level metrics to define if the application is cache- or DRAM-bound and
    | the NUMA efficiency. Use Intel(R) VTune(TM) Amplifier Memory Access analysis
    | to review a detailed metric breakdown by memory hierarchy, memory bandwidth
    | information, and correlation by memory objects.
        Cache Stalls:                             32.70% of cycles
    | A significant proportion of cycles are spent on data fetches from cache. Use
    | Intel(R) VTune(TM) Amplifier Memory Access analysis to see if accesses to L2
    | or L3 cache are problematic and consider applying the same performance tuning
    | as you would for a cache-missing workload. This may include reducing the data
    | working set size, improving data access locality, blocking or partitioning the
    | working set to fit in the lower cache levels, or exploiting hardware
    | prefetchers.
        NUMA: % of Remote Accesses:                0.50%
        Average DRAM Bandwidth:                   80.00  GB/s
    | The system spent significant time heavily utilizing DRAM bandwidth. Improve
    | data accesses to reduce cacheline transfers from/to memory using these
    | possible techniques: 1) consume all bytes of each cacheline before it is
    | evicted (for example, reorder structure elements and split non-hot ones); 2)
    | merge compute-limited and bandwidth-limited loops; 3) use NUMA optimizations
    | on a multi-socket system. You can also allocate data structures that induce
    | DRAM traffic to High Bandwidth Memory (HBM), if available. Use Intel(R)
    | VTune(TM) Amplifier XE Memory Access analysis to learn more on possible
    | reasons and next steps in optimization.
     Memory Footprint:
      Resident:      57225.25 MB
      Virtual:       58548.68 MB
    
    Graphical representation of this data is available in the HTML report: /pfs/data1/home/kit/scc/bq0742/cluster_performance_verification/src/stream-5.10/aps_report_20180327_141312.html
    
Last modified 6 days ago Last modified on Apr 4, 2019, 11:10:57 AM