Example Intel Application Performance Snapshot: stream
- Build
stream
benchmarkmodule add compiler/intel/18.0 icc -std=c11 -Ofast -xHost -ipo -qopenmp \ stream.c -o stream
- Set up APS environment on cluster fh2, fh1
module add devel/APS
- Set up APS environment on cluster uc1
source /opt/bwhpc/common/devel/aps/2019/apsvars.sh
- Run benchmark
stream
with APSexport OMP_NUM_THREADS=20 export KMP_AFFINITY=verbose,granularity=core,respect,scatter aps ./stream -n 2500000000
- Output
------------------------------------------------------------- STREAM version $Revision: 5.10 $ ------------------------------------------------------------- This system uses 8 bytes per array element. ------------------------------------------------------------- Array size = 2500000000 (elements) (elements) Memory per array = 19073.5 MiB (= 18.6 GiB). Total memory required = 57220.5 MiB (= 55.9 GiB). Each kernel will be executed 10 times. The *best* time for each kernel (excluding the first iteration) will be used to compute the reported bandwidth. ------------------------------------------------------------- Number of Threads requested = 20 Number of Threads counted = 20 ------------------------------------------------------------- Your clock granularity/precision appears to be 1 microseconds. Each test below will take on the order of 384517 microseconds. (= 384517 clock ticks) Increase the size of the arrays if this shows that you are not getting at least 20 clock ticks per test. ------------------------------------------------------------- WARNING -- The above is only a rough guideline. For best results, please be sure you know the precision of your system timer. ------------------------------------------------------------- Function Best Rate MB/s Med time Min time Max time Copy: 108408.0 0.459804 0.368976 0.489267 Scale: 108619.0 0.466340 0.368260 0.504614 Add: 109616.3 0.615987 0.547364 0.647817 Triad: 98889.5 0.623797 0.606738 0.636934 ------------------------------------------------------------- Solution Validates: avg error less than 1.000000e-13 on all three arrays -------------------------------------------------------------
Emon collector successfully stopped. | Summary information |-------------------------------------------------------------------- Application : stream Report creation date : 2018-03-27 14:12:32 OpenMP threads number : 20 HW Platform : Intel(R) Xeon(R) E5/E7 v3 Processor code named Haswell Logical core count per node: 40 Collector type : Event-based counting driver Used statistics : aps_result_20180327 | | Your application is memory bound. | Use memory access analysis tools like Intel(R) VTune(TM) Amplifier for a detailed metric breakdown by memory hierarchy, memory bandwidth, and correlation by memory objects. | Elapsed time: 32.23 sec CPI Rate: 3.44 | The CPI value may be too high. | This could be caused by such issues as memory stalls, instruction starvation, | branch misprediction, or long latency instructions. | Use Intel(R) VTune(TM) Amplifier General Exploration analysis to specify | particular reasons of high CPI. Serial Time: 0.01 sec 0.03% OpenMP Imbalance: 4.04 sec 12.52% | The metric value can indicate significant time spent by threads waiting at | barriers. Consider using dynamic work scheduling to reduce the imbalance where | possible. Use Intel(R) VTune(TM) Amplifier HPC Performance Characterization | analysis to review imbalance data distributed by barriers of different lexical | regions. Memory Stalls: 84.50% of pipeline slots | The metric value can indicate that a significant fraction of execution | pipeline slots could be stalled due to demand memory load and stores. See the | second level metrics to define if the application is cache- or DRAM-bound and | the NUMA efficiency. Use Intel(R) VTune(TM) Amplifier Memory Access analysis | to review a detailed metric breakdown by memory hierarchy, memory bandwidth | information, and correlation by memory objects. Cache Stalls: 32.70% of cycles | A significant proportion of cycles are spent on data fetches from cache. Use | Intel(R) VTune(TM) Amplifier Memory Access analysis to see if accesses to L2 | or L3 cache are problematic and consider applying the same performance tuning | as you would for a cache-missing workload. This may include reducing the data | working set size, improving data access locality, blocking or partitioning the | working set to fit in the lower cache levels, or exploiting hardware | prefetchers. NUMA: % of Remote Accesses: 0.50% Average DRAM Bandwidth: 80.00 GB/s | The system spent significant time heavily utilizing DRAM bandwidth. Improve | data accesses to reduce cacheline transfers from/to memory using these | possible techniques: 1) consume all bytes of each cacheline before it is | evicted (for example, reorder structure elements and split non-hot ones); 2) | merge compute-limited and bandwidth-limited loops; 3) use NUMA optimizations | on a multi-socket system. You can also allocate data structures that induce | DRAM traffic to High Bandwidth Memory (HBM), if available. Use Intel(R) | VTune(TM) Amplifier XE Memory Access analysis to learn more on possible | reasons and next steps in optimization. Memory Footprint: Resident: 57225.25 MB Virtual: 58548.68 MB Graphical representation of this data is available in the HTML report: /pfs/data1/home/kit/scc/bq0742/cluster_performance_verification/src/stream-5.10/aps_report_20180327_141312.html
Last modified 6 days ago
Last modified on Apr 4, 2019, 11:10:57 AM