Is it possible use 4 nested loops with OpenACC?

I am trying to put 4 nested loops on GPU by OpenACC. Here is a simplified example:

Subroutine indexed_copy_4d( &
   arr_dst, arr_src, &
   i0,i1,is, j0,j1,js, k0,k1,ks, m0,m1,ms, &
   ki_dst, kj_dst, kk_dst, km_dst, kc_dst, &
   ki_src, kj_src, kk_src, km_src, kc_src )

Implicit None

Real, Intent(out), Dimension(1:) :: arr_dst
Real, Intent(in), Dimension(1:) :: arr_src

Integer, Intent(in) :: &
   i0,i1,is, j0,j1,js, k0,k1,ks, m0,m1,ms, &
   ki_dst, kj_dst, kk_dst, km_dst, kc_dst, &
   ki_src, kj_src, kk_src, km_src, kc_src

Integer :: i,j,k,m

!$acc kernels present(arr_dst,arr_src)
!$acc loop independent
do i=i0,i1,is
!$acc loop independent
do j=j0,j1,js
!$acc loop independent
do k=k0,k1,ks

   !$acc loop seq              ! $$$$
   do m=m0,m1,ms          ! $$$$

      arr_dst(ki_dst*i+kj_dst*j+kk_dst*k+kc_dst) = arr_src(ki_src*i+kj_src*j+kk_src*k+kc_src)

   enddo             ! $$$$

enddo
enddo
enddo
!$acc end kernels

End Subroutine indexed_copy_4d

Eventually m needs to be included in the calculated index, but that’s irrelevant here. The problem is that compiler always fails due to internal error:

PGF90-W-0155-Compiler failed to translate accelerator region (see -Minfo messages): Unknown variable reference (nestedloop.f90: 23)
PGF90-S-0000-Internal compiler error. gen_aili: unrec. ili opcode:     345 (nestedloop.f90: 29)
pgf90-Fatal-/home/lluo6/pgi/linux86-64/13.8/bin/pgf902 TERMINATED by signal 11
Arguments to /home/lluo6/pgi/linux86-64/13.8/bin/pgf902
/home/lluo6/pgi/linux86-64/13.8/bin/pgf902 /tmp/pgf90RsHcbFp-ah1T.ilm -fn nestedloop.f90 -opt 2 -terse 1 -inform warn -x 51 0x20 -x 119 0xa10000 -x 122 0x40 -x 123 0x1000 -x 127 4 -x 127 17 -x 19 0x400000 -x 28 0x40000 -x 120 0x10000000 -x 70 0x8000 -x 122 1 -x 125 0x20000 -quad -x 59 4 -x 59 4 -tp istanbul -x 120 0x1000 -x 124 0x1400 -y 15 2 -x 57 0x3b0000 -x 58 0x48000000 -x 49 0x100 -x 120 0x200 -astype 0 -x 70 0x40000000 -x 124 1 -accel nvidia -accel host -x 186 0x80000 -x 180 0x400 -x 180 0x4000000 -x 163 0x1 -x 189 8 -x 176 0x140000 -x 177 0x0202007f -x 176 0x100 -x 186 0x10000 -x 176 0x100 -x 186 0x20000 -x 176 0x100 -x 176 0x100 -x 189 4 -y 70 0x40000000 -cmdline '+pgf90 nestedloop.f90 -acc -c' -asm /tmp/pgf90ZsHczs4J7LZr.s

I tried using parallel construct, changing loop orders,… Always get internal error like above.

However, if I just remove all the lines marked with “! $$$$” - removing the internal loop, the compilation finishes without any problem.

It would be straightforward to implement equivalent code in CUDA, so I really don’t know why a sequential loop inside a kernel thread would cause any trouble like this.

Comments are welcome.

Hi rikisyo,

This is a compiler bug that looks like it started with release 13.3 when we increased the loop analysis level depth. The error is being caused by the skip count in the “m” loop, so the work around would be to remove “,ms”.

I added TPR#19579 and sent it to engineering. Since we’re in the late stages of 13.9 release testing, I doubt any fix will make it into 13.9. Possible, but more likely this would go into 13.10.

  • Mat

Problem solved.

Thank you!


This has been fixed in the 13.10 release.

thanks,
dave

Thanks for the update!