cudaMemcpyAsync HtoD and DtoH blocking each other

canadi · April 25, 2024, 12:02pm

I noticed performance problems when using cudaMemcpyAsync on different streams in different threads.

In Thread 1 I do a big DtoH cudaMemcpyAsync while starting a small HtoD cudaMemcpyAsync on Thread 2.

For some reason the small HtoD memcpy waits until the big DtoH memcpy is finished. What costs me about 20ms without any use for the second thread.

I boiled it down to a weird behavior of my GPU. It is a RTX3070 Mobile (PCIe 4.0 x8). When reading it’s device properties it has “deviceOverlap == 1” but also “asyncEngineCount == 1”. This seems to be the problem. Why does my GPU seem to not be able to do Full Duplex PCIe transfers?

Greg · April 25, 2024, 12:31pm

With only 1 asynchronous copy engine data transfer will be limited to HtoD or DtoH. If this is problematic then the two options are as follows:

Write simple copy kernels to perform the DtoH or HtoD. This requires that the host side be in pinned system memory on Windows or on Linux the system supports UVM.
Reduce the HtoD size to very small sizes in which case the driver will use a different copy path than the asynchronous copy engine. The maximum size is not documented but you can start experimenting around 10 KiBs per HtoD.

If latency is a problem then the recommended approach would be to ensure that no copy is so large as to hold off the other stream of copies. This can be done by limiting the size of each copy.

canadi · April 25, 2024, 1:09pm

All host memories are pinned. The big DtoH transfer is around 250MB, the (blocked) small HtoD transfer is 800 Bytes.

Yes I could work around this delay. But that would be a patch for a single instance of a intransparent problem. My main questions in this case are:

Why does the GPU report deviceOverlap true while having only one async copy engine?
Why does a 3070 only has one async copy engine? Shouldn’t it have more?

Robert_Crovella · April 25, 2024, 2:22pm

The definition for device overlap is:

Device can concurrently copy memory and execute a kernel.

Your GPU can do that.

There is no specification for the number of async engines that a RTX 3070 mobile device has. Therefore there is no public definition for what it “should” have, other than what is reported by cudaGetDeviceProperties(). It is not the only GPU to have a single async engine, other GPUs have had this characteristic as well.

canadi · April 25, 2024, 4:08pm

Ok, then I missunderstood the device overlap flag. Thank you for the correction.

I read somewhere that multiple async engines where standard since fermi. But if this isn’t the case I have to find a suitable and general workaround.

My (blocked) HtoD transfer is of size 800B. The driver still seams to use the async engine for this transfer. Can you tell me more about your second approach? Can I force the driver to use another copy method or should I use a synchronous cudaMemcpy for small data?

Topic		Replies	Views
Queueing device-to-device/peer memcpy stalls concurrent copy operations CUDA Programming and Performance	6	270	June 11, 2024
Can multiple cudaMemcpyAsync be executed in parallel? CUDA Programming and Performance cuda	5	450	August 4, 2023
Overhead using cudaMemcpyAsync CUDA Programming and Performance	5	3205	September 1, 2009
cudaMemcpyAsync cpu Load? CUDA Programming and Performance cuda	2	564	April 24, 2023
Memory copy/set async to kernel execution in different stream CUDA Programming and Performance	5	1049	December 15, 2022
Copies between CPU and GPU CUDA Programming and Performance	8	5350	November 3, 2009
Slow memory transfers CUDA Programming and Performance	7	1995	May 23, 2011
Concurrent Data Transfers CUDA Programming and Performance	9	7692	April 27, 2012
Overlap Device2Host and Host2Device memcpy? How can we overlap two cudaMemcpy calls? CUDA Programming and Performance	4	4479	June 4, 2008
Unpredictable nature of GPU action timing in Nsight CUDA Programming and Performance	10	1203	October 18, 2015

cudaMemcpyAsync HtoD and DtoH blocking each other

Related topics