I have problems with a kernel that uses arrays of a size that is not known at compile time:
attributes(global) SUBROUTINE LWTT_GPU (KLON)
USE CUDAFOR
REAL :: ZTTNC(KLON)
ZTTNC(1)=1
END
Compilation leads to ‘PGF90-S-0155-device arrays may not be automatic’.
What alternatives do I have?
Thanks
AClimate
Hi AClimate,
Automatic arrays have an implicit allocation. When CUDA Fortran was first developed, on device allocation was not available. Hence the restriction. While later versions of CUDA do support on device allocation, we kept the restriction since having many thousands of threads all allocating memory has a severe negative impact on performance.
Instead, you can make these arrays “shared” and pass in the size of the shared memory as the third parameter of your kernel launch configuration. So long as the total does not exceed the available amount of shared memory then this is a good method to use. Since “shared” memory is shared by all threads in a block, if each thread needs to have it’s own memory, add an extra dimension to ZTTNC.
For example, something like
attributes(global) SUBROUTINE LWTT_GPU (KLON)
USE CUDAFOR
INTEGER, VALUE :: KLON
REAL, SHARED :: ZTTNC(blockdim%x,KLON)
ZTTNC(threadidx%x,1)=1
END
shdsize = 4*KLON*threads
call LWTT_GPU<<<blocks,threads,shdsize>>>(KLON)
Note, KLON should be passed by value given Fortran defaults to passing by reference. You could instead make KLON is a device scalar but that would mean having a second host copy.
Also, make sure LWTT_GPU has an explicit interface or is in a module.
Hope this helps,
Mat