wiki:Tools/perf/example_stream

Example perf: stream

  • Build stream benchmark
    module purge
    module add compiler/gnu/8
    gcc -std=c11 -Ofast -march=native -flto -fopenmp \
        -g \
         stream.c -o stream
    
  • Set up OpenMP environment
    export OMP_NUM_THREADS=20
    export OMP_DISPLAY_ENV=VERBOSE
    export OMP_PLACES=cores
    
  • Record performance data of benchmark stream for use with perf report and perf annotate
    perf record ./stream -n 2500000000
    
    OPENMP DISPLAY ENVIRONMENT BEGIN
      _OPENMP = '201511'
      OMP_DYNAMIC = 'FALSE'
      OMP_NESTED = 'FALSE'
      OMP_NUM_THREADS = '20'
      OMP_SCHEDULE = 'DYNAMIC'
      OMP_PROC_BIND = 'TRUE'
      OMP_PLACES = '{0,20},{1,21},{2,22},{3,23},{4,24},{5,25},{6,26},{7,27},{8,28},{9,29},{10,30},{11,31},{12,32},{13,33},{14,34},{15,35},{16,36},{17,37},{18,38},{19,39}'
      OMP_STACKSIZE = '0'
      OMP_WAIT_POLICY = 'PASSIVE'
      OMP_THREAD_LIMIT = '4294967295'
      OMP_MAX_ACTIVE_LEVELS = '2147483647'
      OMP_CANCELLATION = 'FALSE'
      OMP_DEFAULT_DEVICE = '0'
      OMP_MAX_TASK_PRIORITY = '0'
      GOMP_CPU_AFFINITY = ''
      GOMP_STACKSIZE = '0'
      GOMP_SPINCOUNT = '300000'
    OPENMP DISPLAY ENVIRONMENT END
    
    -------------------------------------------------------------
    STREAM version $Revision: 5.10 $
    -------------------------------------------------------------
    This system uses 8 bytes per array element.
    -------------------------------------------------------------
    Array size = 2500000000 (elements) (elements)
    Memory per array = 19073.5 MiB (= 18.6 GiB).
    Total memory required = 57220.5 MiB (= 55.9 GiB).
    Each kernel will be executed 10 times.
     The *best* time for each kernel (excluding the first iteration)
     will be used to compute the reported bandwidth.
    -------------------------------------------------------------
    Number of Threads requested = 20
    Number of Threads counted = 20
    -------------------------------------------------------------
    Your clock granularity/precision appears to be 1 microseconds.
    Each test below will take on the order of 386554 microseconds.
       (= 386554 clock ticks)
    Increase the size of the arrays if this shows that
    you are not getting at least 20 clock ticks per test.
    -------------------------------------------------------------
    WARNING -- The above is only a rough guideline.
    For best results, please be sure you know the
    precision of your system timer.
    -------------------------------------------------------------
    Function    Best Rate MB/s  Med time     Min time     Max time
    Copy:           99789.3     0.484283     0.400845     0.519110
    Scale:          74203.0     0.608462     0.539061     0.636535
    Add:            77438.6     0.790049     0.774808     0.801701
    Triad:          79069.4     0.788340     0.758827     0.814521
    -------------------------------------------------------------
    Solution Validates: avg error less than 1.000000e-13 on all three arrays
    -------------------------------------------------------------
    
    [ perf record: Woken up 85 times to write data ]
    [ perf record: Captured and wrote 94.379 MB perf.data (2467598 samples) ]
    
  • Create performance report
    perf report
    
      23.90%  stream   stream             [.] tuned_STREAM_Triad._omp_fn.14
      23.78%  stream   stream             [.] tuned_STREAM_Add._omp_fn.13
      17.90%  stream   stream             [.] tuned_STREAM_Scale._omp_fn.12
      13.40%  stream   libc-2.17.so       [.] __memcpy_ssse3
    ...
    
    • Interactive navigation in performance report:
      • h: get help
      • a: jump to annotated assembler code
  • Create report with annotated source
    perf annotate
    
    ...
           │592              // Instructs the compiler to use non-temporal (that is, streaming) stores
           │593              #pragma vector nontemporal
           │594          #endif
           │595          #pragma omp simd aligned (a, b, c : alignment_bytes)
           │596          for (long int j = 0; j < STREAM_ARRAY_SIZE_thread; j++)
           │597              a[j] = b[j] + scalar * c[j];
    ...
      0.49 │ e0:┌─→vmovup ymm0,YMMWORD PTR [r11+rax*1]
     44.81 │    │  vfmadd ymm0,ymm2,YMMWORD PTR [r13+rax*1+0x0]
     49.67 │    │  add    rcx,0x1
      1.79 │    │  vmovup YMMWORD PTR [rdx+rax*1],ymm0
      2.82 │    │  add    rax,0x20
      0.42 │    │  cmp    rcx,r12
      0.00 │    └──jb     e0
    ...
    
    • Interactive navigation in annoteted assembler code:
      • h: get help
      • H: jump to hottest place (most often called place)
      • k: toggle line number view on/off
      • s: toggle source code view on/off
    • Assembler code:
      • vfmadd: Vector fused-multiply-add
      • vmovupd: Vector move unaligned packed double-precision floating-point values

Why does this example use GNU compiler instead of Intel Compiler?

  • Intel compiler inlines all functions -> All time is spent in main-function
  • Intel heavily optimizes assembler code -> Matching between assembler instruction and C source code line is difficult
Last modified 5 days ago Last modified on Apr 5, 2019, 1:25:56 PM