Kernel optimization tips

Dear ncu users,

I’m trying to optimize the following kernel that takes 9% of total time on GPU. ( is the most expensive GPu kernel in my code):

  subroutine add2s2_omp(a,b,c1,n)
  real a(n),b(n)
    do i=1,n

This is the compiler output:

  8, !$omp target teams loop
      8, Generating "nvkernel_foo_add2s2_omp__F1L8_1" GPU kernel
         Generating Tesla code
        9, Loop parallelized across teams, threads(128) ! blockidx%x threadidx%x
      8, Generating Multicore code
        9, Loop parallelized across threads
  8, Generating implicit map(tofrom:b(:),a(:)) 
  9, Generated vector simd code for the loop
     FMA (fused multiply-add) instruction(s) generated

I tried TEAMS DISTRIBUTE PARALLEL DO and TEAMS LOOP with no performance improvement. Attached a small example with dimensions of real test case and NCU profiling. SInce the kernel is very small, and it is a reduction on each elements on array, I hope to improve such performances. Thanks.

add2s2_omp.f (2.3 KB)

The kernel is very simple, memory bound, and it is not clear you can do too much to improve it. You might try calling cublas saxpy. It might have a highly-tuned implementation of this operation.