Dear ncu users,
I’m trying to optimize the following kernel that takes 9% of total time on GPU. ( is the most expensive GPu kernel in my code):
subroutine add2s2_omp(a,b,c1,n) real a(n),b(n) !$OMP TARGET TEAMS LOOP do i=1,n a(i)=a(i)+c1*b(i) enddo return end
This is the compiler output:
add2s2_omp: 8, !$omp target teams loop 8, Generating "nvkernel_foo_add2s2_omp__F1L8_1" GPU kernel Generating Tesla code 9, Loop parallelized across teams, threads(128) ! blockidx%x threadidx%x 8, Generating Multicore code 9, Loop parallelized across threads 8, Generating implicit map(tofrom:b(:),a(:)) 9, Generated vector simd code for the loop FMA (fused multiply-add) instruction(s) generated
I tried TEAMS DISTRIBUTE PARALLEL DO and TEAMS LOOP with no performance improvement. Attached a small example with dimensions of real test case and NCU profiling. SInce the kernel is very small, and it is a reduction on each elements on array, I hope to improve such performances. Thanks.
add2s2_omp.f (2.3 KB)