I am trying to put 4 nested loops on GPU by OpenACC. Here is a simplified example:
Subroutine indexed_copy_4d( &
arr_dst, arr_src, &
i0,i1,is, j0,j1,js, k0,k1,ks, m0,m1,ms, &
ki_dst, kj_dst, kk_dst, km_dst, kc_dst, &
ki_src, kj_src, kk_src, km_src, kc_src )
Implicit None
Real, Intent(out), Dimension(1:) :: arr_dst
Real, Intent(in), Dimension(1:) :: arr_src
Integer, Intent(in) :: &
i0,i1,is, j0,j1,js, k0,k1,ks, m0,m1,ms, &
ki_dst, kj_dst, kk_dst, km_dst, kc_dst, &
ki_src, kj_src, kk_src, km_src, kc_src
Integer :: i,j,k,m
!$acc kernels present(arr_dst,arr_src)
!$acc loop independent
do i=i0,i1,is
!$acc loop independent
do j=j0,j1,js
!$acc loop independent
do k=k0,k1,ks
!$acc loop seq ! $$$$
do m=m0,m1,ms ! $$$$
arr_dst(ki_dst*i+kj_dst*j+kk_dst*k+kc_dst) = arr_src(ki_src*i+kj_src*j+kk_src*k+kc_src)
enddo ! $$$$
enddo
enddo
enddo
!$acc end kernels
End Subroutine indexed_copy_4d
Eventually m needs to be included in the calculated index, but that’s irrelevant here. The problem is that compiler always fails due to internal error:
PGF90-W-0155-Compiler failed to translate accelerator region (see -Minfo messages): Unknown variable reference (nestedloop.f90: 23)
PGF90-S-0000-Internal compiler error. gen_aili: unrec. ili opcode: 345 (nestedloop.f90: 29)
pgf90-Fatal-/home/lluo6/pgi/linux86-64/13.8/bin/pgf902 TERMINATED by signal 11
Arguments to /home/lluo6/pgi/linux86-64/13.8/bin/pgf902
/home/lluo6/pgi/linux86-64/13.8/bin/pgf902 /tmp/pgf90RsHcbFp-ah1T.ilm -fn nestedloop.f90 -opt 2 -terse 1 -inform warn -x 51 0x20 -x 119 0xa10000 -x 122 0x40 -x 123 0x1000 -x 127 4 -x 127 17 -x 19 0x400000 -x 28 0x40000 -x 120 0x10000000 -x 70 0x8000 -x 122 1 -x 125 0x20000 -quad -x 59 4 -x 59 4 -tp istanbul -x 120 0x1000 -x 124 0x1400 -y 15 2 -x 57 0x3b0000 -x 58 0x48000000 -x 49 0x100 -x 120 0x200 -astype 0 -x 70 0x40000000 -x 124 1 -accel nvidia -accel host -x 186 0x80000 -x 180 0x400 -x 180 0x4000000 -x 163 0x1 -x 189 8 -x 176 0x140000 -x 177 0x0202007f -x 176 0x100 -x 186 0x10000 -x 176 0x100 -x 186 0x20000 -x 176 0x100 -x 176 0x100 -x 189 4 -y 70 0x40000000 -cmdline '+pgf90 nestedloop.f90 -acc -c' -asm /tmp/pgf90ZsHczs4J7LZr.s
I tried using parallel construct, changing loop orders,… Always get internal error like above.
However, if I just remove all the lines marked with “! $$$$” - removing the internal loop, the compilation finishes without any problem.
It would be straightforward to implement equivalent code in CUDA, so I really don’t know why a sequential loop inside a kernel thread would cause any trouble like this.
Comments are welcome.