My problem is that I want to replace some of existing Fortran subroutines with CUDA C functions to accelerate the calculation. Mixing programming is not a big deal. The issue is that I want to minimize data transfer between host and device memory and don’t know if it is possible.
The logical flow looks like this “Fortran” --> “CUDA initialization” --> “Fortran” --> “CUDA kernel functions” --> “Fortran”
There are two huge arrays that remain constant in the calculation. I want to copy them from the host memory to device memory in the “CUDA initialization” stage, so that the “CUDA kernel functions” can access these two huge arrays. Since the size of the arrays is large, it is not possible to store them into device constant memory (which can seen from both host and device sides). So, the first question is that is this possible to do such kind of coding? The second question is if I store them into device global memory, how can I access them in kernel functions (by defining device global variables?)? Thanks for any input!
Yes, global variables are persistent between kernel calls, and it is indeed a good idea to save PCIe bandwidth by keeping large arrays on the device rather than copying forth and back between host and device.
If the arrays are constant size, just declare them on the device and use cudaMemcpyToSymbol() to copy between host and device.
Otherwise, use cudaMalloc() and cudaMemcpy() to dynamically allocate device memory and fill it from the copy on the host, and pass the pointer to it as an argument to your kernel.
Btw., constant memory cannot be seen from the host side, you have to copy the contents using cudaMemcpy()/cudaMemcpyToSymbol() just as with every other device memory.
subroutine vadd( A, B, C )
real(4), dimension(:) :: A, B, C
real(4), device, allocatable:: Ad(:), Bd(:), Cd(:) ! << declaring arrays on the GPU
integer :: N
N = size( A, 1 )
allocate( Ad(N), Bd(N), Cd(N) ) ! << allocating arrays on the GPU
Ad = A(1:N) ! << copies the data to the GPU
Bd = B(1:N)
call vaddkernel<<<(N+31)/32,32>>>( Ad, Bd, Cd, N ) ! << calling a kernel and telling it the arrays to work on
! could call other kernels here
C(1:N) = Cd ! << copy results back from GPU
deallocate( Ad, Bd, Cd ) ! << free up GPU memory
You will need a glue layer in C anyway (e.g., Fortran subroutine calls C function which then invokes the kernel). So just keep the device pointers within that glue layer and don’t expose them to the Fortran side.
You don’t need extra files for the glue layer, though - just put them into the .cu file. And don’t forget to declare the glue layer functions as [font=“Courier New”]extern “C”[/font]!
I am aware of the pgi CUDA fortran compiler but currently have no budget to but the compiler. That is why I want to do mixing language programing, which I successfully did several tests using gcc compilers.