Dear Nvidia user,
I’m working on a target regions that used just 25% of GPUs (A100). I already did optimizations using TEAMS LOOP and COLLAPSE directives. But I would to know if is it possible to do better:
!$OMP TARGET DATA MAP(ALLOC:dudr,duds,dudt)
!$OMP& MAP(TO:g1m1,g2m1,g3m1,g4m1,g5m1,g6m1)
!$OMP& MAP(TO:dxm1,dxtm1,u,helm1,helm2,bm1)
!$OMP& MAP(TOFROM:au)
!$OMP TARGET TEAMS LOOP COLLAPSE(4) PRIVATE(tmph1,tmpu1,tmpu2,tmpu3,l)
do e=1,nelt ! line 610
do k=1,lz1
do j=1,ly1
do i=1,lx1
tmph1 = helm1(i,j,k,e)
tmpu1 = 0.0
tmpu2 = 0.0
tmpu3 = 0.0
do l=1,lx1 ! line 618
tmpu1 = tmpu1 + dxm1(i,l)*u(l,j,k,e)
tmpu2 = tmpu2 + dxm1(j,l)*u(i,l,k,e)
tmpu3 = tmpu3 + dxm1(k,l)*u(i,j,l,e)
enddo
wr = g1m1(i,j,k,e)*tmpu1
$ + g4m1(i,j,k,e)*tmpu2
$ + g5m1(i,j,k,e)*tmpu3
ws = g2m1(i,j,k,e)*tmpu2
$ + g4m1(i,j,k,e)*tmpu1
$ + g6m1(i,j,k,e)*tmpu3
wt = g3m1(i,j,k,e)*tmpu3
$ + g5m1(i,j,k,e)*tmpu1
$ + g6m1(i,j,k,e)*tmpu2
dudr(i,j,k,e) = wr * tmph1
duds(i,j,k,e) = ws * tmph1
dudt(i,j,k,e) = wt * tmph1
enddo
enddo
enddo
enddo
!$OMP TARGET TEAMS LOOP COLLAPSE(4) PRIVATE(tmph1,tmpu1,tmpu2,tmpu3,l)
do e=1,nelt !line 646
do k=1,lz1
do j=1,ly1
do i=1,lx1
tmpu1 = 0.0
tmpu2 = 0.0
tmpu3 = 0.0
do l=1,lx1
tmpu1 = tmpu1 + dxtm1(i,l)*dudr(l,j,k,e)
tmpu2 = tmpu2 + dxtm1(j,l)*duds(i,l,k,e)
tmpu3 = tmpu3 + dxtm1(k,l)*dudt(i,j,l,e)
enddo
au(i,j,k,e) = tmpu1 + tmpu2 + tmpu3
enddo
enddo
enddo
enddo
!$OMP END TARGET DATA
The main problem are temporary values in the inner loops, like tmpu1, tmpu2, that inhibit the use of 5 levels of collapsing. Is there a way to optimize it better? This is the compiler output:
609, !$omp target teams loop
609, Generating "nvkernel_axhelm_omp__F1L609_1" GPU kernel
Generating Tesla code
610, Loop parallelized across teams, threads(128) collapse(4) ! blockidx%x threadidx%x
611, ! blockidx%x threadidx%x collapsed
612, ! blockidx%x threadidx%x collapsed
613, ! blockidx%x threadidx%x collapsed
618, Loop run sequentially
609, Generating Multicore code
610, Loop parallelized across threads
609, Generating implicit map(tofrom:g6m1(:,:,:,:),dudt(:,:,:,:),u(:,:,:,:),helm1(:,:,:,:),dxm1(:,:),g3m1(:,:,:,:),g4m1(:,:,:,:),g1m1(:,:,:,:),g5m1(:,:,:,:),g2m1(:,:,:,:),dudr(:,:,:,:),duds(:,:,:,:))
610, Loop not vectorized/parallelized: too deeply nested
611, Loop not vectorized/parallelized: too deeply nested
613, Loop distributed: 3 new loops
Loop interchange produces reordered loop nest: 618,613
Loop unrolled 8 times (completely unrolled)
Generated vector simd code for the loop
FMA (fused multiply-add) instruction(s) generated
618, Loop is parallelizable
645, !$omp target teams loop
645, Generating "nvkernel_axhelm_omp__F1L645_2" GPU kernel
Generating Tesla code
646, Loop parallelized across teams, threads(128) collapse(4) ! blockidx%x threadidx%x
647, ! blockidx%x threadidx%x collapsed
648, ! blockidx%x threadidx%x collapsed
649, ! blockidx%x threadidx%x collapsed
653, Loop run sequentially
645, Generating Multicore code
646, Loop parallelized across threads
645, Generating implicit map(tofrom:dxtm1(:,:),dudt(:,:,:,:),au(:,:,:,:),dudr(:,:,:,:),duds(:,:,:,:))
646, Loop not vectorized/parallelized: too deeply nested
647, Loop not vectorized/parallelized: too deeply nested
649, Loop distributed: 3 new loops
Loop interchange produces reordered loop nest: 653,649
Loop unrolled 8 times (completely unrolled)
Generated vector simd code for the loop
FMA (fused multiply-add) instruction(s) generated
653, Loop is parallelizable
And values of loop counters:
nelt: 9120
lz1: 8
ly1: 8
lx1: 8
Thanks.