CUDA Fortran strange behaviour using automatic arrays and kinds in kernels

Franziskonew · February 10, 2020, 2:23pm

Hello,

we are experiencing serious issues when trying to use some Fortran features in CUDA Fortran kernels.

(1) Defining kinds in pre-processing defines cannot be applied to literal constants

(2) Automatic arrays declared with sizes passed as arguments unpredictably fail at runtime depending on the size of theirselves or of other variables

We attach a reproducing example below.

define RTYPE 8

module kernels_test
use cudafor
contains
attributes(global) subroutine mykernel(nx, ny, ng, v)
integer, value :: nx, ny, ng
real(RTYPE), device :: v(1-ng:nx+ng, 1-ng:ny+ng)

! [1] It works
real(RTYPE) :: v_auto(-1:62)

! [1] It does NOT work (at run-time)
!real(RTYPE) :: v_auto(1-ng:nx+ng)

!real(RTYPE) :: v_auto(-1:62)
!real(RTYPE) :: v_auto(10,10,10,10)
integer, parameter :: rtypepar = RTYPE
i = blockDim%x*(blockIdx%x-1) + threadIdx%x
j = blockDim%y*(blockIdx%y-1) + threadIdx%y
if(i<=nx .and. j<=ny) then
v_auto = 20.

! [2] It works
v(i,j) = 1000._rtypepar + v_auto(1)

! [2] It does NOT work (at compilation time)
!v(i,j) = 1000._RTYPE + v_auto(1)
endif
endsubroutine mykernel
endmodule kernels_test

module const
use cudafor
integer :: nx, ny, ng, iermpi
real(RTYPE), allocatable, dimension(:,:) :: v, v_gpu
attributes(device) :: v_gpu
endmodule const

program main
use const
use cudafor
use kernels_test
type(dim3) :: dimGrid, dimBlock
iermpi = cudaSetDevice(0)
nx = 60
ny = 400
ng = 2
allocate(v(1-ng:nx+ng,1-ng:ny+ng))
allocate(v_gpu(1-ng:nx+ng, 1-ng:ny+ng))
call random_number(v)
!print*,'before: ',v
v_gpu = v

dimBlock = dim3(16,16,1)
dimGrid = dim3((nx+dimBlock%x-1)/(dimBlock%x), (ny+dimBlock%y-1)/(dimBlock%y), 1)
call mykernel<<<dimGrid, dimBlock>>>( nx, ny, ng, v_gpu)
iercuda = cudaDeviceSynchronize()
print*,'cuda error: ',cudaGetErrorString(iercuda)
v = v_gpu
!print*,'after: ',v
endprogram main

Thanks
Francesco

MatColgrove · February 10, 2020, 5:12pm

Hi Francesco,

! [2] It does NOT work (at compilation time)
!v(i,j) = 1000._RTYPE + v_auto(1)

Since “1000._RTYPE” is a string and no preprocessor (not just PGI’s Fortran preprocessor) will replace a sub-string, this is invalid. Instead, I would suggest something like the following:

#define RTYPE 8
#define RT(N) N##_8
...
v(i,j) = RT(1000.) + v_auto(1)

(2) Automatic arrays declared with sizes passed as arguments unpredictably fail at runtime depending on the size of theirselves or of other variables

It is highly recommended to not use automatic arrays in kernels. The problem being that automatics are implicitly allocated. Besides allocation being very slow when performed from within device code, the device has a very small heap (default ~8MB) so when you start getting larger arrays or more threads, the code will overflow and give you a runtime error. This is most likely what’s happening here. Although you can increase the heap size via a call to cudaDeviceSetLimit (see: CUDA Fortran Programming Guide Version 20.1 for x86 and NVIDIA Processors), there’s still a max heap size (~32MB) so not the best solution.

The two recommended options are to either use fixed size arrays or make the arrays shared and pass in the size of the shared memory as the third argument of the kernels launch configuration.

Examples on using dynamic shared memory can be found at: Using Shared Memory in CUDA Fortran | NVIDIA Technical Blog

Hope this helps,
Mat

Franziskonew · February 11, 2020, 10:19pm

Hi Mat,

thanks for the reply.

As for the preprocessor substring replacement, it was my fault, sorry.

As for the automatic arrays, thanks for the clarification on the involved limits. Unfortunately we are developing a multi-block code where the size of the matrices depend on the computational block so that we cannot use fixed size arrays. We will consider the usage of the dynamic shared arrays (even if we do not need sharing memory between CUDA threads) but I am afraid the amount of shared memory per SM is not enough to store our arrays. Maybe, defining global arrays withing the global memory could be an alternative solution.

Bests,
Francesco

MatColgrove · February 11, 2020, 10:56pm

Maybe, defining global arrays withing the global memory could be an alternative solution.

Yes, that would work. Though you’d want to manually privatize them by adding another dimension sized to the block size you’ll be using.

-Mat

Topic		Replies	Views
Nvfortran: Passing shared arrays of variable size to device subroutine causes memory error nvc, nvc++ and nvfortran cuda	7	82	August 28, 2024
Declaring local arrays in device code Legacy PGI Compilers	16	9070	June 8, 2012
Shared array declaration in CUDA Fortran Legacy PGI Compilers	2	3760	February 18, 2015
Strong typing and memory copy Legacy PGI Compilers	7	13309	March 29, 2010
Passing array sizes to shared memory Legacy PGI Compilers	8	5069	October 14, 2010
Error running simple CUDA Fortran program Legacy PGI Compilers	9	21315	February 26, 2010
how does this newbie fix his code? Legacy PGI Compilers	8	3414	March 27, 2012
Is cudaFuncAttributeMaxDynamicSharedMemorySize a supported attriburw? Legacy PGI Compilers	8	2878	June 16, 2020
CUDA Fortran : device variable in module Legacy PGI Compilers	7	17394	February 19, 2010
CudaFotran compiling problem When i am comipiling the cuda fortran code, type mismatch error is com CUDA Programming and Performance	13	3616	December 1, 2009

CUDA Fortran strange behaviour using automatic arrays and kinds in kernels

Related topics