Tools/likwid/example_bench_Ice_Lake

Example: likwid-bench on Intel Xeon Ice Lake

#Cores stream_mem_avx_fma
(MByte/s)
% of Max
1 18932.74 6%
2 37606.15 12%
4 71658.90 22%
8 128337.42 40%
12 178572.55 55%
16 221021.84 68%
32 303688.83 94%
50 323435.61 100%
64 320327.16 99%
76 316941.00 98%

=> 1 core can only get about 6% of peak memory bandwidth
=> 12 cores are needed for 1/2 of peak memory bandwidth
=> Peak bandwidth is reached with 50 cores, after which the bandwidth slowly drops

MEM_SIZE
(Byte)
stream_mem_avx_fma
(MByte/s)
stream_avx_fma
(MByte/s)
stream_mem_avx_fma
/ stream_avx_fma
stream_avx_fma
/ stream_mem_avx_fma
29184 77263.25 171384.46 45 % 222 %
58368 137188.23 405930.71 34 % 296 %
116736 238430.74 775374.46 31 % 325 %
233472 269033.89 1249811.39 22 % 465 %
496128 274972.44 1928920.73 14 % 701 %
1021440 266466.69 2478547.23 11 % 930 %
2042880 286880.57 3080760.17 9 % 1074 %
4085760 665429.01 3708201.59 18 % 557 %
8171520 684543.73 4277030.97 16 % 625 %
16372224 689832.27 5382984.20 13 % 780 %
32744448 480203.78 6254640.60 8 % 1302 %
65518080 405960.67 5563724.35 7 % 1371 %
131065344 640713.77 895231.74 72 % 140 %
262130688 607510.76 546060.93 111 % 90 %
524261376 405670.12 362572.25 112 % 89 %
1048551936 345566.80 332700.65 104 % 96 %
2097133056 329452.87 315844.12 104 % 96 %
4194295296 322903.94 308891,59 105 % 96 %
8388590592 319732.19 310216.19 103 % 97 %
16777210368 319405.49 308877.21 103 % 97 %
33554420736 318836.42 308883.68 103 % 97 %

=> stream_avx_fma does not use streaming stores. All stores go to the cache first. As long as you stay in the cache stream_avx_fma is much faster as stream_mem_avx_fma

=> stream_mem_avx_fma use streaming stores. All stores go directly to main memory. When you leave the cache stream_mem_avx_fma is faster than stream_avx_fma

=> Intel Xeon Ice Lake SpecI2M optimization: Use streaming stores when memory subsystem is heavily loaded (see: HotChips 2020: New 3rd Gen Intel Xeon Scalable Processor)