Tools/APS/example_stream
Example
Intel Application Performance Snapshot: stream
Prepare environment
module purge module add compiler/intel/2022
Build
stream
benchmarkicc -std=c11 -Ofast -xHost -ipo -qopenmp \ -o stream stream.c
Set up APS environment on cluster HoreKa
# Standalone source /software/all/toolkit/Intel_OneAPI/vtune/latest/apsvars.sh # or as part of Intel VTune module add devel/vtune/2023
Set up APS environment on cluster BwUniCluster 2.0
source /software/all/toolkit/Intel_OneAPI/vtune/latest/apsvars.sh
Run benchmark
stream
with APSexport OMP_NUM_THREADS=76 export KMP_AFFINITY=verbose,granularity=core,respect,scatter aps ./stream -n 2500000000
------------------------------------------------------------- STREAM version $Revision: 5.10 $ ------------------------------------------------------------- This system uses 8 bytes per array element. ------------------------------------------------------------- Array size = 2499999936 (elements) Memory per array = 19073.5 MiB (= 18.6 GiB). Total memory required = 57220.5 MiB (= 55.9 GiB). Each kernel will be executed 10 times. The *best* time for each kernel (excluding the first iteration) will be used to compute the reported bandwidth. ------------------------------------------------------------- OpenMP version (yyyymm): 201611 Number of Threads requested = 76 Number of Threads counted = 76 ------------------------------------------------------------- Your clock granularity appears to be 1000 ticks per microseconds. Each test below will take on the order of 128429 microseconds. (= 128429532 clock ticks) Increase the size of the arrays if this shows that you are not getting at least 20 clock ticks per test. ------------------------------------------------------------- WARNING -- The above is only a rough guideline. For best results, please be sure you know the precision of your system timer. ------------------------------------------------------------- Function Best Rate MB/s Med time Min time Max time Copy: 309238.3 0.129884 0.129350 0.130081 Scale: 309508.1 0.129582 0.129237 0.130091 Add: 311823.7 0.192909 0.192416 0.193625 Triad: 312619.3 0.192425 0.191927 0.192974 ------------------------------------------------------------- Solution Validates: avg error less than 1.000000e-13 on all three arrays -------------------------------------------------------------
Intel(R) VTune(TM) Profiler 2023.1.0 collection completed successfully. Use the "aps --report <...>/aps_result_20230526" command to generate textual and HTML reports for the profiling session.
Generate APS report:
aps --report <...>/aps_result_20220602
Loading 100.00% | Summary information |-------------------------------------------------------------------- Application : stream Report creation date : 2023-05-26 10:45:19 OpenMP threads number per Process: 76 HW Platform : Intel(R) Xeon(R) Processor code named Icelake Frequency : 2.39 GHz Logical core count per node : 152 Collector type : Driverless Perf per-process counting Used statistics : <...>/aps_result_20230526 | | Your application might underutilize the available logical CPU cores | because of insufficient parallel work, blocking on synchronization, or too much I/O. Perform function or source line-level profiling with tools like Intel(R) VTune(TM) Profiler to discover why the CPU is underutilized. | Elapsed Time: 8.12 s SP GFLOPS: 0.00 DP GFLOPS: 15.66 Average CPU Frequency: 3.15 GHz IPC Rate: 0.11 | The IPC value may be too low. | This could be caused by issues such as memory stalls, instruction starvation, | branch misprediction or long latency instructions. | Use Intel(R) VTune(TM) Profiler Microarchitecture Exploration analysis to | specify particular reasons of low IPC. Serial Time: 0.05 s 0.57% of Elapsed Time OpenMP Imbalance: 0.13 s 1.63% of Elapsed Time Physical Core Utilization: 92.70% Average Physical Core Utilization: 70.44 out of 76 Physical Cores Memory Stalls: 90.00% of Pipeline Slots | The metric value can indicate that a significant fraction of execution | pipeline slots could be stalled due to demand memory load and stores. See the | second level metrics to define if the application is cache- or DRAM-bound and | the NUMA efficiency. Use Intel(R) VTune(TM) Profiler Memory Access analysis to | review a detailed metric breakdown by memory hierarchy, memory bandwidth | information, and correlation by memory objects. Cache Stalls: 1.30% of Cycles DRAM Stalls: 88.20% of Cycles | The metric value indicates that a significant fraction of cycles could be | stalled on the main memory (DRAM) because of demand loads or stores. Use | Intel(R) VTune(TM) Profiler Memory Access Analysis to get more details if the | code is latency- or bandwidth-bound and what can be done to increase memory | access efficiency. Average DRAM Bandwidth: N/A | Data for this metric is not collected since it requires system-wide | performance monitoring. Make sure the sampling driver is properly installed on | your system: https://software.intel.com/en-us/vtune-amplifier-help-sep-driver. | Otherwise, enable a driverless Perf-based sampling collection by setting the | /proc/sys/kernel/perf_even_paranoid value to 0 or less. NUMA: 0.00% of Remote Accesses Vectorization: 100.00% Instruction Mix: SP FLOPs: 0.00% of uOps DP FLOPs: 16.30% of uOps Packed: 100.00% from DP FP 128-bit: 0.00% 256-bit: 100.00% | A significant fraction of floating point arithmetic vector instructions | executed with partial vector load. A possible reason is compilation with | legacy instruction set. Check the compiler options. Another possible reason is | compiler code generation specifics. Use Intel(R) Advisor to learn more. 512-bit: 0.00% Scalar: 0.00% from DP FP Non-FP: 83.70% of uOps FP Arith/Mem Rd Instr. Ratio: 0.57 FP Arith/Mem Wr Instr. Ratio: 1.02 Memory Footprint: Resident: 58605.00 MB Virtual: 63899.00 MB Graphical representation of this data is available in the HTML report: <...>/aps_report_20230526_104752.html