Concurrent data transfer on RTX 2080 Ti

Hi all,

does RTX 2080 Ti support concurrent data transfer (to and from GPU at the same time)?
For some reason I do not observe transfer overlap in Nsight Systems timeline despite the fact that transfers are done in separate streams, data is transferred from/to paged memory, and there are no dependencies between streams. deviceQuery.exe reports ‘3 copy engines’, so it seems that simultaneous sending/receiving should be possible on this GPU.

What might be going wrong?
I am on Windows 10.

Thank you

Compute/transfer overlap seems to be working OK (there are 3 streams in total: compute, read, write).

windows 10 command batching in WDDM mode can get in the way of observing the desired concurrency scenarios. Whether or not it is impacting your case I cannot say.

Hi Robert,

thank you for your helpful reply. Indeed it seems that calling cuStreamQuery() for the stream where copying to GPU takes place right after enqueuing the copy operations helps to achieve read-write overlap.

Any recommendations on when I should force software queue flushing by calling cuStreamQuery() to improve performance?

My recommendations:
Switch to linux.
Or switch to a GPU that can be placed in TCC mode on windows.

OK, thank you. I wish I could…

regarding stream query flushing of the command queue, the only thing I can suggest is what has been commented on already. Since you came up with the idea on your own in this thread, I’m assuming you may have already read some of Greg’s description. I can’t offer any advice beyond that.

despite the fact that transfers are done in separate streams, data is transferred from/to paged memory, and there are no dependencies between streams.

If the host memory buffer is page-able then the CUDA driver has to do one of the following:

  1. If the transfer is H2D and the size is small then the driver copies the data into the command buffer and uses the GPU front end to copy the data.
  2. If the transfer is H2D then the CUDA driver copies the data from a page-able buffer into a driver own pinned buffer and uses the copy engine to transfer the data. If the size is larger than (1) then this is repeated.
  3. If the transfer is D2H then the CUDA driver uses the copy engine to copy from device to a driver owned pinned system memory buffer and once complete uses the CPU to copy the data to the final host buffer. If the size is large than the driver owned buffer this is repeated multiple times.

In order to get concurrent transfers the recommended strategy is to pin the host buffer. It may be possible to use two different CPU threads to get the driver to support concurrency; however, I believe the CUDA driver still has a critical section and will let one memory copy be active in the driver per context.

Thank you Greg. It was a typo in my original question. Data is transferred from/to pinned memory, not paged. It seems that calling cuStreamQuery() every now and then sufficiently improves overlapping. The timeline is still not as neat as it would be in TCC mode, but I guess we have to live with it.

How about a dedicated ‘cuFlush’ API function (OpenCL has it (clFlush), CUDA does not – any good reason for that)? Any plans to extend TCC support to RTX Ti-series GPUs? I believe this is a sales/marketing decision rather than a technical one.