Questions about "cudaMemcpyAsync"


I have a CUDA FORTRAN code which has the following structure:

host code

arrays copy from host to device

do i = 1,n
   device code

   arrays copy from device to host (with cudaMemcpyAsync directive)
end do

other host code

In particular, I need to copy an array called t (which contains the time value) calculated on the device to the host each time step. The syntax of the cudaMemcpyAsync directive is the following:

ierr = cudaMemcpyAsync(t,t_dev,1,stream_copy)


  • t and t_dev are 1-element arrays on the host and on the device. The one on the host is declared with the “pinned” attribute;
  • stream_copy is a cuda stream integer declared and created as follows:
integer :: stream_copy

ierr = cudaStreamCreate(stream_copy)

The device’s kernels are executed on a stream called “stream_work” generated as the previous one.

What I get are a lot of time values which are the same in chunks of 8-10, since to copy operation requires more time than the elaboration of the datas. Is there any way to improve this? The output I obtain right now is useless and, if I copy each step a time value from the device to the host (without an asynchronous operation), the execution gets really slow (way slower than the CPU code).

Thanks in adavance for your help,


Hi Nicola,

The device’s kernels are executed on a stream called “stream_work” generated as the previous one.

Your kernel needs to be on the same stream as the copy else there is no guarantee that the data is on the device.

Asynchronous is only useful if your able to interleave it with your kernel launches. In other words, you launch kernel 1, then asynchronously copy the data for kernel 2 as kernel 1 executes. Once kernel 2 starts it’s data should be there. What might be useful is to read the following article to see how asynchronous data transfers can help your application.

Best Regards,