Device arrays may not be automatic

Hi,

I am wondering how to have an array that is used in a device kernel that has array bounds that are not known prior to run time.

I have made a little example where A and B are vectors of nTotal elements. nTotal is not known until runtime. A is a variable that gets allocated and assigned in the main program. B is intended to be a device array that stays on the device only. B needs to be an array because the elements of B are unique:

Module Kernel

Use cudafor

Contains

	Attributes(Global) Subroutine NotPassedAutomatic(A,nTotal)

	Implicit None

	Integer:: i
	Integer, Value:: nTotal
	Integer, Device:: A(nTotal), B(nTotal)

	B = 2

	i = (blockIdx%x-1)*blockDim%x + threadIdx%x

	If (i >= 1 .and. i <= nTotal) Then
		A(i) = A(i) + B(i)
	End If

	End Subroutine NotPassedAutomatic

End Module Kernel

Program Main

	Use cudafor
	Use Kernel
	
	Integer:: nTotal
	Integer, Allocatable:: A(:)
	
	Integer, Device, Allocatable:: A_d(:)

	nTotal = 1024

	Allocate(A(nTotal),A_d(nTotal))
	A = 1

	A_d = A

	Call NotPassedAutomatic<<<ceiling(real(nTotal)/128),128>>>(A_d,nTotal)

	A = A_d

	If (any(A /= 3)) Then
		Write(*,*) "Failed A not equal to 3"
	Else
		Write(*,*) "Passed A equal to 3"
	End If

	Deallocate(A,A_d)

End Program Main

The program will compile and run if you change B to a scalar value in the device kernel.

Can someone help me understand how to do this properly.

Thank you,

Kirk

To give a bit more context, B would typically be a array to hold a state variable that is unique for each point in a continuum. The state variable is needed to advance the solution of the simulation, but is not needed to be displayed in the results.

Kirk

Hi Kirk,

It looks to me that you meant for “B” to be global with only “B(i)” to be accessed by an individual thread. In this case, it’s best to move “B” to the module data section, make it an allocatable device array, then assign it’s value from the host.

% cat test.cuf
Module Kernel

Use cudafor

   integer, allocatable, device, dimension(:) :: B
Contains

   Attributes(Global) Subroutine NotPassedAutomatic(A,nTotal)

   Implicit None

   Integer:: i
   Integer, Value:: nTotal
   Integer, Device:: A(nTotal)

   i = (blockIdx%x-1)*blockDim%x + threadIdx%x

   If (i >= 1 .and. i <= nTotal) Then
      A(i) = A(i) + B(i)
   End If

   End Subroutine NotPassedAutomatic

End Module Kernel

Program Main

   Use cudafor
   Use Kernel

   Integer:: nTotal
   Integer, Allocatable:: A(:)

   Integer, Device, Allocatable:: A_d(:)

   nTotal = 1024

   Allocate(A(nTotal),A_d(nTotal),B(nTotal))
   A = 1
   B = 2
   A_d = A

   Call NotPassedAutomatic<<<ceiling(real(nTotal)/128),128>>>(A_d,nTotal)

   A = A_d

   If (any(A /= 3)) Then
      Write(*,*) "Failed A not equal to 3"
   Else
      Write(*,*) "Passed A equal to 3"
   End If

   Deallocate(A,A_d,B)

End Program Main
% pgfortran test.cuf ; a.out
 Passed A equal to 3

Note the failure in your original code is because you were trying to use automatic arrays. Automatics are only supported when used as shared arrays who’s size in bytes are passed as the fourth argument in the kernel launch configuration.

For later version of the PGI compiler and NVIDIA devices with support for relocatable device code (RDC), you could also have each thread allocate a temp array. However, I wouldn’t recommend do this since having all threads allocate memory from the device can hurt performance.

  • Mat

That makes sense, thank you!

Kirk