OpenACC: Array Create = "unsupported statement type&amp

Are local arrays supported in OpenACC? If I change ‘flux_x_temp’ to an allocatable it works, but it feels rather weird to allocate it on the host only to completely forget about it and let OpenACC use it on the device instead. What’s the best practice on locals?

Tested on: Suse Linux, OSX, same results.

The entire code:

module test
implicit none
contains
subroutine wrapper()
	implicit none
	call kernel ()
end subroutine

subroutine kernel()
	implicit none
	real(8) :: flux_x_temp(5,5,1)
	integer(4) :: i, j

!$acc kernels create(flux_x_temp)
!$acc loop independent
	do j=1,5
!$acc loop independent
		do i = 1,5
			flux_x_temp(i,j,0) = 2.0d0
		end do
	end do
!$acc end kernels
end subroutine
end module

program asuca
use test, only: wrapper
implicit none
call wrapper()
stop
end program

Result:

pgf90 -Minfo=accel,inline -Mneginfo -ta=nvidia test_openacc.f90
PGF90-S-0155-Accelerator region ignored; see -Minfo messages (test_openacc.f90: 16)
kernel:
16, Accelerator region ignored
18, Accelerator restriction: loop contains unsupported statement type
19, Accelerator restriction: unsupported statement type
0 inform, 0 warnings, 1 severes, 0 fatal for kernel
pgf90 -v
Export PGI=/usr/apps.sp3/isv/pgi/14.7
pgf90-Warning-No files to process

Hi MuellerM,

The problem here is that dead-code elimination optimization is removing the assignment since the “flux_x_temp” array is never used. Compiling the code without optimization (-O0) or using the values in the array will allow the region to be accelerated.

% pgf90 -acc test.f90 -Minfo=accel -O0
kernel:
     14, Generating create(flux_x_temp(:,:,:))
         Generating Tesla code
     16, Loop is parallelizable
     18, Loop is parallelizable
         Accelerator kernel generated
         16, !$acc loop gang, vector(4) ! blockidx%y threadidx%y
         18, !$acc loop gang, vector(32) ! blockidx%x threadidx%x

Hope this helps,
Mat

Thanks Mat, I didn’t expect this to happen before the OpenACC parallelization comes in. Wouldn’t it be better to do all optimizations after the CUDA (or PTX) code has been generated, to avoid these sorts of errors?

Edit: Ah I think this might come from the optimizer touching the codetree only after Fortran has been parsed, but not after CUDA C has been generated? I think I can understand the reasoning behind this. The speedups I sometimes see for compute bound code vs. naive CUDA C code wouldn’t be possible otherwise, except if you make another pass over the generated C code.