device arrays may not be automatic

I have problems with a kernel that uses arrays of a size that is not known at compile time:

     attributes(global) SUBROUTINE LWTT_GPU (KLON)
      USE CUDAFOR

      REAL :: ZTTNC(KLON)
 
      ZTTNC(1)=1
      
      END

Compilation leads to ‘PGF90-S-0155-device arrays may not be automatic’.
What alternatives do I have?

Thanks
AClimate

Hi AClimate,

Automatic arrays have an implicit allocation. When CUDA Fortran was first developed, on device allocation was not available. Hence the restriction. While later versions of CUDA do support on device allocation, we kept the restriction since having many thousands of threads all allocating memory has a severe negative impact on performance.

Instead, you can make these arrays “shared” and pass in the size of the shared memory as the third parameter of your kernel launch configuration. So long as the total does not exceed the available amount of shared memory then this is a good method to use. Since “shared” memory is shared by all threads in a block, if each thread needs to have it’s own memory, add an extra dimension to ZTTNC.

For example, something like

     attributes(global) SUBROUTINE LWTT_GPU (KLON) 
      USE CUDAFOR 
      INTEGER, VALUE :: KLON
      REAL, SHARED :: ZTTNC(blockdim%x,KLON) 
       
      ZTTNC(threadidx%x,1)=1 
      
      END



   shdsize = 4*KLON*threads
   call LWTT_GPU<<<blocks,threads,shdsize>>>(KLON)

Note, KLON should be passed by value given Fortran defaults to passing by reference. You could instead make KLON is a device scalar but that would mean having a second host copy.

Also, make sure LWTT_GPU has an explicit interface or is in a module.

Hope this helps,
Mat