I am attempting to create a Fortran library which will automatically combine small allocations and data transfers into a single large allocation and data transfer. The current code I am porting to the GPU has a lot of small arrays. The transfer of these arrays onto the GPU is taking up a considerable amount of time because it performs many small transfers instead of a small amount of large transfers.
... real, allocatable :: a(:), b(:) real, device, allocatable :: ad(:), bd(:) allocate(a(N), b(N)) allocate(ad(N), bd(N)) ... ad = a bd = b
... real, pointer :: a(:), b(:) real, device, pointer :: ad(:), bd(:) call dbAllocCopy( a, ad, N ) call dbAllocCopy( b, bd, N ) call dbFlushAlloc() call dbFlushCopy() ...
The routine dbAllocCopy in my library would take the requests and store it in a list. Once dbFlushAlloc is called, all of the items in the list would be allocated as one large memory allocation on the CPU and one large memory allocation on the GPU. The addresses of a, b, ad, bd, would be set to the proper sub arrays of this larger arrays. dbFlushCopy would then perform the copy of the data as one large data transfer. Code could then proceed as normal.
I could easily do something like this in C with pointers and pointer arithmetic, but Fortran is a bit limited with these features (intentionally so from what I understand). If device pointers worked, I could implemented something like this easily, but the documentation indicates that device pointers are not supported yet.
The following simple program compiles, but the kernel fails to run.
module kernelModule contains attributes(global) subroutine gg(ad, bd, cd,N) real, pointer, intent(in) :: ad(:), bd(:) integer, value, intent(in) :: N real, pointer, intent(out) :: cd(:) integer :: i i = threadidx%x + (blockidx%x - 1) * blockdim%x if (i<=N) then cd(i) = ad(i) * bd(i) endif end subroutine end module program PointerTest use cudafor use kernelModule implicit none integer, parameter :: N = 1000 real, allocatable, target :: aa(:) real, device, allocatable, target :: aad(:) real, pointer :: a(:), b(:), c(:) real, device, pointer :: ad(:), bd(:), cd(:) real :: m integer :: i integer :: err type(dim3) :: grid, block allocate( aa(N*3) ) allocate( aad(N*3 )) a=>aa(1:N) b=>aa(N+1:2*N) c=>aa(2*N+1:3*N) ad=>aad(1:N) bd=>aad(N+1:2*N) cd=>aad(2*N+1:3*N) a = 3.0 b = 0.8 c = 0.0 aad = aa grid = dim3( (N+255)/256, 1, 1) block = dim3( 256, 1, 1) print *, "calling kernel" call gg<<<grid, block>>>(ad,bd,cd,N) print *, "error is ", err print *, "done calling kernel" print err c = cd ! check the results on the CPU print *, "checking results on CPU" do i=1, N m = a(i) * b(i) if ( abs(c(i) - m) > 0.0001 ) then print *, "error at index ", i stop end if end do print *, "test ran succesfully" deallocate(aa) deallocate(aad) end program PointerTest
I am assuming this is because I am attempting to use device pointers. Is there a different way to do this that is supported by PGI Fortran? I have been looking into c_ptr and c_devptr and possible cross language solutions but this seems messy and I would like to stay in Fortran if possible.
Thanks for your time,