cudaMemcpyAsync HtoD and DtoH blocking each other

I noticed performance problems when using cudaMemcpyAsync on different streams in different threads.

In Thread 1 I do a big DtoH cudaMemcpyAsync while starting a small HtoD cudaMemcpyAsync on Thread 2.

For some reason the small HtoD memcpy waits until the big DtoH memcpy is finished. What costs me about 20ms without any use for the second thread.

I boiled it down to a weird behavior of my GPU. It is a RTX3070 Mobile (PCIe 4.0 x8). When reading it’s device properties it has “deviceOverlap == 1” but also “asyncEngineCount == 1”. This seems to be the problem. Why does my GPU seem to not be able to do Full Duplex PCIe transfers?

With only 1 asynchronous copy engine data transfer will be limited to HtoD or DtoH. If this is problematic then the two options are as follows:

  1. Write simple copy kernels to perform the DtoH or HtoD. This requires that the host side be in pinned system memory on Windows or on Linux the system supports UVM.
  2. Reduce the HtoD size to very small sizes in which case the driver will use a different copy path than the asynchronous copy engine. The maximum size is not documented but you can start experimenting around 10 KiBs per HtoD.

If latency is a problem then the recommended approach would be to ensure that no copy is so large as to hold off the other stream of copies. This can be done by limiting the size of each copy.

All host memories are pinned. The big DtoH transfer is around 250MB, the (blocked) small HtoD transfer is 800 Bytes.

Yes I could work around this delay. But that would be a patch for a single instance of a intransparent problem. My main questions in this case are:

  • Why does the GPU report deviceOverlap true while having only one async copy engine?
  • Why does a 3070 only has one async copy engine? Shouldn’t it have more?

The definition for device overlap is:

Device can concurrently copy memory and execute a kernel.

Your GPU can do that.

There is no specification for the number of async engines that a RTX 3070 mobile device has. Therefore there is no public definition for what it “should” have, other than what is reported by cudaGetDeviceProperties(). It is not the only GPU to have a single async engine, other GPUs have had this characteristic as well.

Ok, then I missunderstood the device overlap flag. Thank you for the correction.

I read somewhere that multiple async engines where standard since fermi. But if this isn’t the case I have to find a suitable and general workaround.

My (blocked) HtoD transfer is of size 800B. The driver still seams to use the async engine for this transfer. Can you tell me more about your second approach? Can I force the driver to use another copy method or should I use a synchronous cudaMemcpy for small data?