CUDA Fortran strange behaviour using automatic arrays and kinds in kernels

Hello,

we are experiencing serious issues when trying to use some Fortran features in CUDA Fortran kernels.

(1) Defining kinds in pre-processing defines cannot be applied to literal constants

(2) Automatic arrays declared with sizes passed as arguments unpredictably fail at runtime depending on the size of theirselves or of other variables

We attach a reproducing example below.

#define RTYPE 8

module kernels_test
use cudafor
contains
attributes(global) subroutine mykernel(nx, ny, ng, v)
integer, value :: nx, ny, ng
real(RTYPE), device :: v(1-ng:nx+ng, 1-ng:ny+ng)

! [1] It works
real(RTYPE) :: v_auto(-1:62)

! [1] It does NOT work (at run-time)
!real(RTYPE) :: v_auto(1-ng:nx+ng)

!real(RTYPE) :: v_auto(-1:62)
!real(RTYPE) :: v_auto(10,10,10,10)
integer, parameter :: rtypepar = RTYPE
i = blockDim%x*(blockIdx%x-1) + threadIdx%x
j = blockDim%y*(blockIdx%y-1) + threadIdx%y
if(i<=nx .and. j<=ny) then
v_auto = 20.

! [2] It works
v(i,j) = 1000._rtypepar + v_auto(1)

! [2] It does NOT work (at compilation time)
!v(i,j) = 1000._RTYPE + v_auto(1)
endif
endsubroutine mykernel
endmodule kernels_test

module const
use cudafor
integer :: nx, ny, ng, iermpi
real(RTYPE), allocatable, dimension(:,:) :: v, v_gpu
attributes(device) :: v_gpu
endmodule const

program main
use const
use cudafor
use kernels_test
type(dim3) :: dimGrid, dimBlock
iermpi = cudaSetDevice(0)
nx = 60
ny = 400
ng = 2
allocate(v(1-ng:nx+ng,1-ng:ny+ng))
allocate(v_gpu(1-ng:nx+ng, 1-ng:ny+ng))
call random_number(v)
!print*,'before: ',v
v_gpu = v

dimBlock = dim3(16,16,1)
dimGrid = dim3((nx+dimBlock%x-1)/(dimBlock%x), (ny+dimBlock%y-1)/(dimBlock%y), 1)
call mykernel<<<dimGrid, dimBlock>>>( nx, ny, ng, v_gpu)
iercuda = cudaDeviceSynchronize()
print*,'cuda error: ',cudaGetErrorString(iercuda)
v = v_gpu
!print*,'after: ',v
endprogram main

Thanks
Francesco

Hi Francesco,

! [2] It does NOT work (at compilation time)
!v(i,j) = 1000._RTYPE + v_auto(1)

Since “1000._RTYPE” is a string and no preprocessor (not just PGI’s Fortran preprocessor) will replace a sub-string, this is invalid. Instead, I would suggest something like the following:

#define RTYPE 8
#define RT(N) N##_8
...
v(i,j) = RT(1000.) + v_auto(1)



(2) Automatic arrays declared with sizes passed as arguments unpredictably fail at runtime depending on the size of theirselves or of other variables

It is highly recommended to not use automatic arrays in kernels. The problem being that automatics are implicitly allocated. Besides allocation being very slow when performed from within device code, the device has a very small heap (default ~8MB) so when you start getting larger arrays or more threads, the code will overflow and give you a runtime error. This is most likely what’s happening here. Although you can increase the heap size via a call to cudaDeviceSetLimit (see: https://www.pgroup.com/resources/docs/20.1/x86/cuda-fortran-prog-guide/index.htm#dm-cudadevicesetlimit), there’s still a max heap size (~32MB) so not the best solution.

The two recommended options are to either use fixed size arrays or make the arrays shared and pass in the size of the shared memory as the third argument of the kernels launch configuration.

Examples on using dynamic shared memory can be found at: https://devblogs.nvidia.com/using-shared-memory-cuda-fortran/

Hope this helps,
Mat

Hi Mat,

thanks for the reply.

As for the preprocessor substring replacement, it was my fault, sorry.

As for the automatic arrays, thanks for the clarification on the involved limits. Unfortunately we are developing a multi-block code where the size of the matrices depend on the computational block so that we cannot use fixed size arrays. We will consider the usage of the dynamic shared arrays (even if we do not need sharing memory between CUDA threads) but I am afraid the amount of shared memory per SM is not enough to store our arrays. Maybe, defining global arrays withing the global memory could be an alternative solution.

Bests,
Francesco

Maybe, defining global arrays withing the global memory could be an alternative solution.

Yes, that would work. Though you’d want to manually privatize them by adding another dimension sized to the block size you’ll be using.

-Mat