Hi,
I am wondering how to have an array that is used in a device kernel that has array bounds that are not known prior to run time.
I have made a little example where A and B are vectors of nTotal elements. nTotal is not known until runtime. A is a variable that gets allocated and assigned in the main program. B is intended to be a device array that stays on the device only. B needs to be an array because the elements of B are unique:
Module Kernel
Use cudafor
Contains
Attributes(Global) Subroutine NotPassedAutomatic(A,nTotal)
Implicit None
Integer:: i
Integer, Value:: nTotal
Integer, Device:: A(nTotal), B(nTotal)
B = 2
i = (blockIdx%x-1)*blockDim%x + threadIdx%x
If (i >= 1 .and. i <= nTotal) Then
A(i) = A(i) + B(i)
End If
End Subroutine NotPassedAutomatic
End Module Kernel
Program Main
Use cudafor
Use Kernel
Integer:: nTotal
Integer, Allocatable:: A(:)
Integer, Device, Allocatable:: A_d(:)
nTotal = 1024
Allocate(A(nTotal),A_d(nTotal))
A = 1
A_d = A
Call NotPassedAutomatic<<<ceiling(real(nTotal)/128),128>>>(A_d,nTotal)
A = A_d
If (any(A /= 3)) Then
Write(*,*) "Failed A not equal to 3"
Else
Write(*,*) "Passed A equal to 3"
End If
Deallocate(A,A_d)
End Program Main
The program will compile and run if you change B to a scalar value in the device kernel.
Can someone help me understand how to do this properly.
Thank you,
Kirk
To give a bit more context, B would typically be a array to hold a state variable that is unique for each point in a continuum. The state variable is needed to advance the solution of the simulation, but is not needed to be displayed in the results.
Kirk
Hi Kirk,
It looks to me that you meant for “B” to be global with only “B(i)” to be accessed by an individual thread. In this case, it’s best to move “B” to the module data section, make it an allocatable device array, then assign it’s value from the host.
% cat test.cuf
Module Kernel
Use cudafor
integer, allocatable, device, dimension(:) :: B
Contains
Attributes(Global) Subroutine NotPassedAutomatic(A,nTotal)
Implicit None
Integer:: i
Integer, Value:: nTotal
Integer, Device:: A(nTotal)
i = (blockIdx%x-1)*blockDim%x + threadIdx%x
If (i >= 1 .and. i <= nTotal) Then
A(i) = A(i) + B(i)
End If
End Subroutine NotPassedAutomatic
End Module Kernel
Program Main
Use cudafor
Use Kernel
Integer:: nTotal
Integer, Allocatable:: A(:)
Integer, Device, Allocatable:: A_d(:)
nTotal = 1024
Allocate(A(nTotal),A_d(nTotal),B(nTotal))
A = 1
B = 2
A_d = A
Call NotPassedAutomatic<<<ceiling(real(nTotal)/128),128>>>(A_d,nTotal)
A = A_d
If (any(A /= 3)) Then
Write(*,*) "Failed A not equal to 3"
Else
Write(*,*) "Passed A equal to 3"
End If
Deallocate(A,A_d,B)
End Program Main
% pgfortran test.cuf ; a.out
Passed A equal to 3
Note the failure in your original code is because you were trying to use automatic arrays. Automatics are only supported when used as shared arrays who’s size in bytes are passed as the fourth argument in the kernel launch configuration.
For later version of the PGI compiler and NVIDIA devices with support for relocatable device code (RDC), you could also have each thread allocate a temp array. However, I wouldn’t recommend do this since having all threads allocate memory from the device can hurt performance.
That makes sense, thank you!
Kirk