cudaMemcpyAsync

cudaMemcpy operations, issued in the same direction (i.e. host to device) will always serialize. The data will not be “copied in parallel”. This is due to the characteristics of the PCIE bus: only one outstanding operations can be transmitted at a time.

It’s not really clear what you are trying to accomplish. The usual reasons for use of the async API are for overlap:

kernel - kernel
memcpy - kernel
memcpy - memcpy (one is one direction, the other is in the other direction)
host - device

There are many nuances to get this correct. I would suggest that you start by reading the section on asynchronous concurrency in the programming guide.

1 Like