Asynchronous copy conundrum

Hi there,

I’m having a few problems with setting up my asynchronous copies. The way in which I am trying to use them is to overlap host computation with the data transfers. The execution path is roughly:


main routine

call to asynchronous data transfer subroutine
.
.
.
call to host computation subroutine
.
.
.
call to device computation subroutine


I don’t get any error when the asynchronous data transfer are carried out but the results I’m getting are incorrect.

After running the program in emulation mode it became apparent that the device code is not using the data that was transferred. I think this may be due to the data transfers and the actual device code being in separate files so the data in the device code doesn’t point to the same location as when the data was initially transferred. I tried passing the device data variables as arguments in to the device code but this still didn’t rectify my problem…

Is there a way of passing a pointer to device memory between subroutines in host memory?

Or, does anyone have any other suggestions on the matter?

Cheers,
Crip_crop

Okay, I did a work around so that the async data transfers happen in the same module as the kernel. This seems to work, to a degree. It compiles but now when I run the executable I get this error:

0: ALLOCATE: 18446744073347163744 bytes requested; not enough memory

This seemed rather large to me so I calculated the amount of pinned, allocatable data I had declared in order to asynchronously copy my arrays. The actual number of bytes I’m attempting to request is more in the region of 70,000.

Is there a standard limit on how much pinned memory is available?

Also, has anyone seen a problem like this before?

Any help would really be appreciated.

Cheers,
Crip_crop

Also, I tried using the -Mlarge_arrays compiler flag and it gave me loads of these errors on compilation:

undefined reference to `pgf90_pinned_allocated_i8’

Crip_crop

Hi Crip_crop,

I’d need so example of what you’re doing to better help, but the very large allocate value seems to indicate that a junk value is being passed to the allocations.

Using asynchronous CUDA Fortran routines can be a bit tricky since your needing to work with CUDA C pointers and not the Fortran allocatable. Using ISO_C_BINDING routines can help. Here’s a basic example from one of our internal QA tests.

subroutine subi1
use cudafor
use test_symbols
use check_mod
integer, parameter :: N = 400
integer*4, device :: x(100)
integer*1, device, allocatable :: fx(:)  !phony x, equivalenced
integer*1  ec(N)
integer*1, allocatable, pinned :: c(:)
type(cudaSymbol) :: cs
logical running_emu
!
if (running_emu()) then
  cs = C_DEVLOC(sdev)
else
  cs = "_test_symbols_16"
endif
call c_f_pointer(C_DEVLOC(x),fx,N)
!
allocate(c(N))
c = z'44'
istat = cudaMemcpyToSymbolAsync(cs,c,N,0,cudaMemcpyHostToDevice,0)
if (istat .ne. 0) print *,"ToSymbolAsync 0, istat =",istat
istat = cudaMemcpyFromSymbolAsync(fx,cs,N,0,cudaMemcpyDeviceToDevice,0)
if (istat .ne. 0) print *,"FromSymbolAsync 1, istat =",istat
c = z'11'
istat = cudaMemcpyToSymbolAsync(cs,c,N,0,cudaMemcpyHostToDevice,0)
if (istat .ne. 0) print *,"ToSymbolAsync 2, istat =",istat
call s1 <<<1, 100>>> (x)
istat = cudaMemcpyToSymbolAsync(cs,fx,N,0,cudaMemcpyDeviceToDevice,0)
if (istat .ne. 0) print *,"ToSymbolAsync 3, istat =",istat
istat = cudaMemcpyFromSymbolAsync(c,cs,N,0,cudaMemcpyDeviceToHost,0)
if (istat .ne. 0) print *,"FromSymbolAsync 4, istat =",istat
ec = z'55'
istat = cudaThreadSynchronize()
deallocate(c)
return
end