Tools/APS/example_stream
Example
Intel Application Performance Snapshot: stream
Build
stream
benchmarkmodule add compiler/intel/2022 icc -std=c11 -Ofast -xHost -ipo -qopenmp \ -o stream stream.c
Set up APS environment on cluster HoreKa
# Standalone source /software/all/toolkit/Intel_OneAPI/vtune/latest/apsvars.sh # or as part of Intel VTune module add devel/vtune/2022
Set up APS environment on cluster BwUniCluster 2.0
source /software/all/toolkit/Intel_OneAPI/vtune/latest/apsvars.sh
Run benchmark
stream
with APSexport OMP_NUM_THREADS=76 export KMP_AFFINITY=verbose,granularity=core,respect,scatter aps ./stream -n 2500000000
------------------------------------------------------------- STREAM version $Revision: 5.10 $ ------------------------------------------------------------- This system uses 8 bytes per array element. ------------------------------------------------------------- Array size = 2499999936 (elements) Memory per array = 19073.5 MiB (= 18.6 GiB). Total memory required = 57220.5 MiB (= 55.9 GiB). Each kernel will be executed 10 times. The *best* time for each kernel (excluding the first iteration) will be used to compute the reported bandwidth. ------------------------------------------------------------- OpenMP version (yyyymm): 201611 Number of Threads requested = 76 Number of Threads counted = 76 ------------------------------------------------------------- Your clock granularity appears to be 1000 ticks per microseconds. Each test below will take on the order of 139607 microseconds. (= 139607275 clock ticks) Increase the size of the arrays if this shows that you are not getting at least 20 clock ticks per test. ------------------------------------------------------------- WARNING -- The above is only a rough guideline. For best results, please be sure you know the precision of your system timer. ------------------------------------------------------------- Function Best Rate MB/s Med time Min time Max time Copy: 312887.1 0.128112 0.127842 0.131924 Scale: 313607.5 0.127720 0.127548 0.132112 Add: 317174.0 0.189375 0.189171 0.193327 Triad: 317797.3 0.189005 0.188800 0.195389 ------------------------------------------------------------- Solution Validates: avg error less than 1.000000e-13 on all three arrays -------------------------------------------------------------
Intel(R) oneAPI VTune(TM) Profiler 2022.0.0 collection completed successfully. Use the "aps --report <...>/aps_result_20220602" command to generate textual and HTML reports for the profiling session.
Generate APS report:
aps --report <...>/aps_result_20220602
Loading 100.00% | Summary information |-------------------------------------------------------------------- Application : stream Report creation date : 2022-06-02 10:17:22 OpenMP threads number per Process: 76 HW Platform : Intel(R) Xeon(R) Processor code named Icelake Frequency : 2.39 GHz Logical core count per node : 152 Collector type : Driverless Perf per-process counting Used statistics : <...>/aps_result_20220602 | | Your application is memory bound. | Use memory access analysis tools like Intel(R) VTune(TM) Profiler for a detailed metric breakdown by memory hierarchy, memory bandwidth, and correlation by memory objects. | Elapsed Time: 8.14 s SP GFLOPS: 0.00 DP GFLOPS: 15.65 Average CPU Frequency: 3.15 GHz IPC Rate: 0.12 | The IPC value may be too low. | This could be caused by issues such as memory stalls, instruction starvation, | branch misprediction or long latency instructions. | Use Intel(R) VTune(TM) Profiler Microarchitecture Exploration analysis to | specify particular reasons of low IPC. Serial Time: 0.07 s 0.86% of Elapsed Time OpenMP Imbalance: 0.16 s 1.97% of Elapsed Time Memory Stalls: 89.60% of Pipeline Slots | The metric value can indicate that a significant fraction of execution | pipeline slots could be stalled due to demand memory load and stores. See the | second level metrics to define if the application is cache- or DRAM-bound and | the NUMA efficiency. Use Intel(R) VTune(TM) Profiler Memory Access analysis to | review a detailed metric breakdown by memory hierarchy, memory bandwidth | information, and correlation by memory objects. Cache Stalls: 1.40% of Cycles DRAM Stalls: 87.50% of Cycles | The metric value indicates that a significant fraction of cycles could be | stalled on the main memory (DRAM) because of demand loads or stores. Use | Intel(R) VTune(TM) Profiler Memory Access Analysis to get more details if the | code is latency- or bandwidth-bound and what can be done to increase memory | access efficiency. Average DRAM Bandwidth: N/A | Data for this metric is not collected since it requires system-wide | performance monitoring. Make sure the sampling driver is properly installed on | your system: https://software.intel.com/en-us/vtune-amplifier-help-sep-driver. | Otherwise, enable a driverless Perf-based sampling collection by setting the | /proc/sys/kernel/perf_even_paranoid value to 0 or less. NUMA: 0.00% of Remote Accesses Vectorization: 100.00% Instruction Mix: SP FLOPs: 0.00% of uOps DP FLOPs: 15.10% of uOps Packed: 100.00% from DP FP 128-bit: 0.00% 256-bit: 100.00% | A significant fraction of floating point arithmetic vector instructions | executed with partial vector load. A possible reason is compilation with | legacy instruction set. Check the compiler options. Another possible reason is | compiler code generation specifics. Use Intel(R) Advisor to learn more. 512-bit: 0.00% Scalar: 0.00% from DP FP Non-FP: 84.90% of uOps FP Arith/Mem Rd Instr. Ratio: 0.53 FP Arith/Mem Wr Instr. Ratio: 1.00 Memory Footprint: Resident: 58602.00 MB Virtual: 63853.00 MB Graphical representation of this data is available in the HTML report: <...>/aps_report_20220602_102205.html