Example likwid-perfctr
performance group MEM
on benchmark stream
- Build
stream
benchmark with GNU compilermodule purge module add compiler/gnu/7 gcc -std=c11 -Ofast -march=native -flto -fopenmp \ stream.c -o stream
- Set up OpenMP environment
export OMP_NUM_THREADS=20
- List available performance groups
likwid-perfctr -a
... MEM Main memory bandwidth in MBytes/s MEM_DP Overview of arithmetic and main memory performance MEM_SP Overview of arithmetic and main memory performance NUMA Local and remote data transfers ...
- Get detailed information on performance groups
likwid-perfctr -H --group MEM
Group MEM: Formulas: Memory read bandwidth [MBytes/s] = 1.0E-06*(SUM(MBOXxC0))*64.0/time Memory read data volume [GBytes] = 1.0E-09*(SUM(MBOXxC0))*64.0 Memory write bandwidth [MBytes/s] = 1.0E-06*(SUM(MBOXxC1))*64.0/time Memory write data volume [GBytes] = 1.0E-09*(SUM(MBOXxC1))*64.0 Memory bandwidth [MBytes/s] = 1.0E-06*(SUM(MBOXxC0)+SUM(MBOXxC1))*64.0/time Memory data volume [GBytes] = 1.0E-09*(SUM(MBOXxC0)+SUM(MBOXxC1))*64.0 - Profiling group to measure memory bandwidth drawn by all cores of a socket. Since this group is based on Uncore events it is only possible to measure on a per socket base. Also outputs total data volume transferred from main memory.
- Messure performance group
MEM
for benchmarkstream
on CPU 0 to 19likwid-perfctr \ --group MEM \ -C 0-19 \ ./stream -n 100000000
-------------------------------------------------------------------------------- CPU name: Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz CPU type: Intel Xeon Haswell EN/EP/EX processor CPU clock: 2.30 GHz --------------------------------------------------------------------------------
------------------------------------------------------------- STREAM version $Revision: 5.10 $ ------------------------------------------------------------- This system uses 8 bytes per array element. ------------------------------------------------------------- Array size = 100000000 (elements) (elements) Memory per array = 762.9 MiB (= 0.7 GiB). Total memory required = 2288.8 MiB (= 2.2 GiB). Each kernel will be executed 10 times. The *best* time for each kernel (excluding the first iteration) will be used to compute the reported bandwidth. ------------------------------------------------------------- Number of Threads requested = 20 Number of Threads counted = 20 ------------------------------------------------------------- Your clock granularity/precision appears to be 1 microseconds. Each test below will take on the order of 15955 microseconds. (= 15955 clock ticks) Increase the size of the arrays if this shows that you are not getting at least 20 clock ticks per test. ------------------------------------------------------------- WARNING -- The above is only a rough guideline. For best results, please be sure you know the precision of your system timer. ------------------------------------------------------------- Function Best Rate MB/s Avg time Min time Max time Copy: 102387.3 0.016643 0.015627 0.020727 Scale: 72944.7 0.022326 0.021934 0.024904 Add: 81663.2 0.029859 0.029389 0.032404 Triad: 81578.8 0.029487 0.029419 0.029520 ------------------------------------------------------------- Solution Validates: avg error less than 1.000000e-13 on all three arrays -------------------------------------------------------------
-------------------------------------------------------------------------------- Group 1: MEM +-----------------------+---------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+ | Event | Counter | Core 0 | Core 1 | Core 2 | Core 3 | Core 4 | Core 5 | Core 6 | Core 7 | Core 8 | Core 9 | Core 10 | Core 11 | Core 12 | Core 13 | Core 14 | Core 15 | Core 16 | Core 17 | Core 18 | Core 19 | +-----------------------+---------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+ | INSTR_RETIRED_ANY | FIXC0 | 938295039 | 860131666 | 844442515 | 862490745 | 855282747 | 863090338 | 869367586 | 862849791 | 851971039 | 866514428 | 851549344 | 851361100 | 844113575 | 864512780 | 851730705 | 859556446 | 843886084 | 860724500 | 872205627 | 872122155 | | CPU_CLK_UNHALTED_CORE | FIXC1 | 2618345331 | 2536800836 | 2551076323 | 2534122499 | 2547936397 | 2548491438 | 2539820912 | 2532567385 | 2542849578 | 2546347750 | 2551713389 | 2551333284 | 2537147110 | 2529932248 | 2551593906 | 2542084452 | 2541590035 | 2546862115 | 2644565960 | 2550826845 | | CPU_CLK_UNHALTED_REF | FIXC2 | 2308912178 | 2244121316 | 2256404949 | 2240746434 | 2253571004 | 2254085192 | 2246398592 | 2239977222 | 2248706734 | 2251983751 | 2257285021 | 2256941125 | 2243232274 | 2237576712 | 2256949681 | 2247969906 | 2246809234 | 2252129709 | 2334964324 | 2255982393 | | CAS_COUNT_RD | MBOX0C0 | 149185857 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 147906384 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | | CAS_COUNT_WR | MBOX0C1 | 69185518 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 69109259 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | | CAS_COUNT_RD | MBOX1C0 | 152486380 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 147870827 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | | CAS_COUNT_WR | MBOX1C1 | 69626367 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 69075586 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | | CAS_COUNT_RD | MBOX2C0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | | CAS_COUNT_WR | MBOX2C1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | | CAS_COUNT_RD | MBOX3C0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | | CAS_COUNT_WR | MBOX3C1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | | CAS_COUNT_RD | MBOX4C0 | 149851128 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 147883074 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | | CAS_COUNT_WR | MBOX4C1 | 69262252 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 69133632 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | | CAS_COUNT_RD | MBOX5C0 | 149850079 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 147845659 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | | CAS_COUNT_WR | MBOX5C1 | 69420339 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 69101844 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | | CAS_COUNT_RD | MBOX6C0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | | CAS_COUNT_WR | MBOX6C1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | | CAS_COUNT_RD | MBOX7C0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | | CAS_COUNT_WR | MBOX7C1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | +-----------------------+---------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+ +----------------------------+---------+-------------+------------+------------+--------------+ | Event | Counter | Sum | Min | Max | Avg | +----------------------------+---------+-------------+------------+------------+--------------+ | INSTR_RETIRED_ANY STAT | FIXC0 | 17246198210 | 843886084 | 938295039 | 8.623099e+08 | | CPU_CLK_UNHALTED_CORE STAT | FIXC1 | 51046007793 | 2529932248 | 2644565960 | 2.552300e+09 | | CPU_CLK_UNHALTED_REF STAT | FIXC2 | 45134747751 | 2237576712 | 2334964324 | 2.256737e+09 | | CAS_COUNT_RD STAT | MBOX0C0 | 297092241 | 0 | 149185857 | 1.485461e+07 | | CAS_COUNT_WR STAT | MBOX0C1 | 138294777 | 0 | 69185518 | 6.914739e+06 | | CAS_COUNT_RD STAT | MBOX1C0 | 300357207 | 0 | 152486380 | 1.501786e+07 | | CAS_COUNT_WR STAT | MBOX1C1 | 138701953 | 0 | 69626367 | 6.935098e+06 | | CAS_COUNT_RD STAT | MBOX2C0 | 0 | 0 | 0 | 0 | | CAS_COUNT_WR STAT | MBOX2C1 | 0 | 0 | 0 | 0 | | CAS_COUNT_RD STAT | MBOX3C0 | 0 | 0 | 0 | 0 | | CAS_COUNT_WR STAT | MBOX3C1 | 0 | 0 | 0 | 0 | | CAS_COUNT_RD STAT | MBOX4C0 | 297734202 | 0 | 149851128 | 1.488671e+07 | | CAS_COUNT_WR STAT | MBOX4C1 | 138395884 | 0 | 69262252 | 6.919794e+06 | | CAS_COUNT_RD STAT | MBOX5C0 | 297695738 | 0 | 149850079 | 1.488479e+07 | | CAS_COUNT_WR STAT | MBOX5C1 | 138522183 | 0 | 69420339 | 6.926109e+06 | | CAS_COUNT_RD STAT | MBOX6C0 | 0 | 0 | 0 | 0 | | CAS_COUNT_WR STAT | MBOX6C1 | 0 | 0 | 0 | 0 | | CAS_COUNT_RD STAT | MBOX7C0 | 0 | 0 | 0 | 0 | | CAS_COUNT_WR STAT | MBOX7C1 | 0 | 0 | 0 | 0 | +----------------------------+---------+-------------+------------+------------+--------------+ +-----------------------------------+------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+ | Metric | Core 0 | Core 1 | Core 2 | Core 3 | Core 4 | Core 5 | Core 6 | Core 7 | Core 8 | Core 9 | Core 10 | Core 11 | Core 12 | Core 13 | Core 14 | Core 15 | Core 16 | Core 17 | Core 18 | Core 19 | +-----------------------------------+------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+ | Runtime (RDTSC) [s] | 1.4661 | 1.4661 | 1.4661 | 1.4661 | 1.4661 | 1.4661 | 1.4661 | 1.4661 | 1.4661 | 1.4661 | 1.4661 | 1.4661 | 1.4661 | 1.4661 | 1.4661 | 1.4661 | 1.4661 | 1.4661 | 1.4661 | 1.4661 | | Runtime unhalted [s] | 1.1384 | 1.1030 | 1.1092 | 1.1018 | 1.1078 | 1.1081 | 1.1043 | 1.1011 | 1.1056 | 1.1071 | 1.1095 | 1.1093 | 1.1031 | 1.1000 | 1.1094 | 1.1053 | 1.1051 | 1.1073 | 1.1498 | 1.1091 | | Clock [MHz] | 2608.1950 | 2599.9236 | 2600.3210 | 2601.0904 | 2600.3864 | 2600.3596 | 2600.3801 | 2600.3868 | 2600.8087 | 2600.5967 | 2599.9563 | 2599.9651 | 2601.3091 | 2600.4680 | 2600.2208 | 2600.8783 | 2601.7158 | 2600.9535 | 2604.9219 | 2600.5537 | | CPI | 2.7905 | 2.9493 | 3.0210 | 2.9381 | 2.9791 | 2.9528 | 2.9215 | 2.9351 | 2.9847 | 2.9386 | 2.9966 | 2.9968 | 3.0057 | 2.9264 | 2.9958 | 2.9574 | 3.0118 | 2.9590 | 3.0320 | 2.9249 | | Memory read bandwidth [MBytes/s] | 26251.9756 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 25821.2260 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | | Memory read data volume [GBytes] | 38.4879 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 37.8564 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | | Memory write bandwidth [MBytes/s] | 12113.5682 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 12066.6777 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | | Memory write data volume [GBytes] | 17.7596 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 17.6909 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | | Memory bandwidth [MBytes/s] | 38365.5438 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 37887.9037 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | | Memory data volume [GBytes] | 56.2475 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 55.5473 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | +-----------------------------------+------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+ +----------------------------------------+------------+-----------+------------+-----------+ | Metric | Sum | Min | Max | Avg | +----------------------------------------+------------+-----------+------------+-----------+ | Runtime (RDTSC) [s] STAT | 29.3220 | 1.4661 | 1.4661 | 1.4661 | | Runtime unhalted [s] STAT | 22.1943 | 1.1000 | 1.1498 | 1.1097 | | Clock [MHz] STAT | 52023.3908 | 2599.9236 | 2608.1950 | 2601.1695 | | CPI STAT | 59.2171 | 2.7905 | 3.0320 | 2.9609 | | Memory read bandwidth [MBytes/s] STAT | 52073.2016 | 0 | 26251.9756 | 2603.6601 | | Memory read data volume [GBytes] STAT | 76.3443 | 0 | 38.4879 | 3.8172 | | Memory write bandwidth [MBytes/s] STAT | 24180.2459 | 0 | 12113.5682 | 1209.0123 | | Memory write data volume [GBytes] STAT | 35.4505 | 0 | 17.7596 | 1.7725 | | Memory bandwidth [MBytes/s] STAT | 76253.4475 | 0 | 38365.5438 | 3812.6724 | | Memory data volume [GBytes] STAT | 111.7948 | 0 | 56.2475 | 5.5897 | +----------------------------------------+------------+-----------+------------+-----------+
- All memory related performance counters are only accounted on first CPU core on the socket
- Validity check
Socket 0: Memory read bandwidth: 262519756 Memory write bandwidth: 121135682 + --------- 383655438 Memory bandwidth: 383655438
Memory write data volume socket 0: 17.7596 GB Memory write data volume socket 1: 17.6909 GB + ---------- 35.4505 GB Memory write data volume [GBytes] STAT: 35.4505 GB
Memory read data volume socket 0: 38.4879 GB Memory read data volume socket 1: 37.8564 GB + ---------- 76.3443 GB Memory read data volume [GBytes] STAT: 76.3443 GB
#Elements/vec = 100.000.000 #Bytes/Element = 8 #Bytes/vec = 100.000.000 * 8 = 800.000.000 #Num repetition = 10 Copy: 1 Vec. read, 1 Vec. write Scale: 1 Vec. read, 1 Vec. write Add: 2 Vec. read, 1 Vec. write Triad: 2 Vec. read, 1 Vec. write 4 vec. write * 10 repetition * 800.000.000 Bytes/vec = 32 GB ~ 35.4505 Memory write data volume [GBytes] STAT 6 Vec. read * 10 repetition * 800.000.000 Bytes/vec = 48 GB !~ 76.3443 Memory read data volume [GBytes] STAT 6 Vec. read + 4 Vec. write * 10 repetition * 800.000.000 Bytes/vec = 80 GB ~ 76.3443 Memory read data volume [GBytes] STAT
- Each store to memory triggers an extra read from memory. => GNU compiler does not use non-temporal stores which can directly write to memory.
Example likwid-perfctr
performance group NUMA
on benchmark stream
- Build
stream
benchmark with Intel compilermodule purge module add compiler/intel/18.0 icc -std=c11 -Ofast -xHost -ipo -qopenmp \ stream.c -o stream
- Set up OpenMP environment
export OMP_NUM_THREADS=20
- List available performance groups
likwid-perfctr -a
... MEM Main memory bandwidth in MBytes/s MEM_DP Overview of arithmetic and main memory performance MEM_SP Overview of arithmetic and main memory performance NUMA Local and remote data transfers ...
- Get detailed information on performance groups
likwid-perfctr -H --group NUMA
Group NUMA: Formula: CPI = CPU_CLK_UNHALTED_CORE/INSTR_RETIRED_ANY Local DRAM data volume [GByte] = 1.E-09*OFFCORE_RESPONSE_0_LOCAL_DRAM*64 Local DRAM bandwidth [MByte/s] = 1.E-06*(OFFCORE_RESPONSE_0_LOCAL_DRAM*64)/time Remote DRAM data volume [GByte] = 1.E-09*OFFCORE_RESPONSE_1_REMOTE_DRAM*64 Remote DRAM bandwidth [MByte/s] = 1.E-06*(OFFCORE_RESPONSE_1_REMOTE_DRAM*64)/time Memory data volume [GByte] = 1.E-09*(OFFCORE_RESPONSE_0_LOCAL_DRAM+OFFCORE_RESPONSE_1_REMOTE_DRAM)*64 Memory bandwidth [MByte/s] = 1.E-06*((OFFCORE_RESPONSE_0_LOCAL_DRAM+OFFCORE_RESPONSE_1_REMOTE_DRAM)*64)/time -- This performance group measures the data traffic of CPU cores to local and remote memory.
- Messure performance group
NUMA
for benchmarkstream
on CPU 0 to 19 with locally allocated memorylikwid-perfctr --group NUMA -C 0-19 \ numactl --localalloc \ ./stream -n 100000000
... ------------------------------------------------------------- Function Best Rate MB/s Avg time Min time Max time Copy: 104573.5 0.015537 0.015300 0.015842 Scale: 105859.6 0.015214 0.015114 0.015308 Add: 108120.1 0.022280 0.022198 0.022395 Triad: 109300.7 0.021987 0.021958 0.022040 -------------------------------------------------------------
... +--------------------------------------+------------+--------------+-----------+-----------+ | Metric | Sum | Min | Max | Avg | +--------------------------------------+------------+--------------+-----------+-----------+ | Runtime (RDTSC) [s] STAT | 37.9020 | 1.8951 | 1.8951 | 1.8951 | | Runtime unhalted [s] STAT | 18.2252 | 0.8769 | 1.3908 | 0.9113 | | Clock [MHz] STAT | 58088.7700 | 2899.9413 | 2989.1844 | 2904.4385 | | CPI STAT | 128.2410 | 1.0635 | 7.0634 | 6.4120 | | Local DRAM data volume [GByte] STAT | 13.3097 | 0.6477 | 0.6756 | 0.6655 | | Local DRAM bandwidth [MByte/s] STAT | 7023.3007 | 341.7686 | 356.5150 | 351.1650 | | Remote DRAM data volume [GByte] STAT | 0.0063 | 2.496000e-05 | 0.0008 | 0.0003 | | Remote DRAM bandwidth [MByte/s] STAT | 3.2454 | 0.0132 | 0.4004 | 0.1623 | | Memory data volume [GByte] STAT | 13.3158 | 0.6481 | 0.6758 | 0.6658 | | Memory bandwidth [MByte/s] STAT | 7026.5460 | 342.0066 | 356.5833 | 351.3273 | +--------------------------------------+------------+--------------+-----------+-----------+
- Remote DRAM data volume and Remote DRAM bandwidth are very low
- Messure performance group
NUMA
for benchmarkstream
on CPU 0 to 19 with all allocated memory in NUMA domain 0likwid-perfctr --group NUMA -C 0-19 \ numactl --membind=0 \ ./stream -n 100000000
... ------------------------------------------------------------- Function Best Rate MB/s Avg time Min time Max time Copy: 50143.6 0.031936 0.031908 0.031993 Scale: 49960.4 0.032053 0.032025 0.032086 Add: 56319.0 0.042653 0.042614 0.042680 Triad: 56425.9 0.042577 0.042534 0.042612 -------------------------------------------------------------
... +--------------------------------------+------------+--------------+-----------+-----------+ | Metric | Sum | Min | Max | Avg | +--------------------------------------+------------+--------------+-----------+-----------+ | Runtime (RDTSC) [s] STAT | 53.4480 | 2.6724 | 2.6724 | 2.6724 | | Runtime unhalted [s] STAT | 34.9158 | 1.6857 | 2.1850 | 1.7458 | | Clock [MHz] STAT | 58063.6648 | 2899.9915 | 2963.3627 | 2903.1832 | | CPI STAT | 167.8990 | 1.3344 | 14.4638 | 8.3950 | | Local DRAM data volume [GByte] STAT | 6.5933 | 7.744000e-06 | 0.6628 | 0.3297 | | Local DRAM bandwidth [MByte/s] STAT | 2467.1862 | 0.0029 | 248.0175 | 123.3593 | | Remote DRAM data volume [GByte] STAT | 6.6188 | 0 | 0.6689 | 0.3309 | | Remote DRAM bandwidth [MByte/s] STAT | 2476.7374 | 0 | 250.3028 | 123.8369 | | Memory data volume [GByte] STAT | 13.2118 | 0.6343 | 0.6689 | 0.6606 | | Memory bandwidth [MByte/s] STAT | 4943.9239 | 237.3654 | 250.3130 | 247.1962 | +--------------------------------------+------------+--------------+-----------+-----------+
- Remote DRAM data volume and Remote DRAM bandwidth are very high
- Memory bandwidth halved
Last modified 5 days ago
Last modified on Apr 5, 2019, 10:17:06 AM