Maxwell. Overlapping data transfers

Hi everyone. I´m newbie in the forum and I hope that you will help me with my question. Recently, I´ve developed an application in which I´ve used CUDA streams with the aim of overlapping computation and data transfers. I’ve executed this application on a GPU Nvidia (Maxwell architecture). I’ve observed with the Visual Profiler tool that some data transfers HostToDevice occur at the same time. The Maxwell GPUs only have 2 Copy engines. One copy engine is for the HostToDevice transfers and the other copy engine is for the DeviceToHost transfers, right?. With this in mind, I think that two HostToDevice transfers can´t occur at the same time. However, I´ve observed with Visual Profiler that this behaviour appears in my application. So, my question is: in this architecture, is it possible that two HostToDevice (or DeviceToHost) data transfers might occur at the same time?.

Thank you so much.

i do not know the direct answer to your question

initially it seems like a trivial question, but the more i think about it, the more it appears otherwise

“I´ve observed with Visual Profiler that this behaviour appears in my application”

if one ‘attacks’ this from a vantage point of hypothetical cases:
a) VS is wrong, and this does not actually occur
b) your eyes are deceiving you
c) neither a) nor b); this is actually occurring

a) seems unlikely; b) seems possible; c) seems likely
a lot can be said about c) as well

memory copies may imply moving bytes from source to destination, but may equally involve a lot more than this; hence, the actual time spent moving the bytes may be a fraction of the total time reported under a memory copy

a) if the host memory is not pinned, some intervention must occur at some point to ensure it is resident

b) the memory copies may involve mutex locks or similar mechanisms to ensure data synchronization/ consistency on the host

c) the memory copies would share a common pci bus, but more importantly, a common pci bridge/ hub/ controller; i am not sure how that would react to multiple requests to access host memory at different locations; the transfers are likely interleaved, or ‘serialized’/ scheduled, i would think

hence, the transfers are shown to occur concurrently, but do they truly occur concurrently - i.e. to the last byte?

As far as I know, although PCIe-transfers are packetized, PCIe does not allow more than one transfer in one direction at any one time. If you have excluded the possibility of mis-interpreting the output of the visual profiler, an indication of multiple simultaneous transfers in the same direction would appear to indicate a possible bug in the profiler. It may be a GUI issue, have you checked actual time stamps?

Thank you so much for your help. I will keep in mind all your advices.

Hello!!. I’ve checked the time stamps of the different memcpy (both HTD and DTH) and I’ve developed other code simpler to discard possible errors of my code. I´ve launched this code with the VP tool and I’ve observed the same behaviour. For example, there are two HTD transfers which occur at the same time. The first HTD transfer starts at the 2,632 ms and it ends at the 7,871 ms. The second HTD transfer starts at the 3,689 ms and it ends at the 11,851 ms. The total execution time reported by the VP tool is the same if I execute my application by the command line. Therefore, I think that the time information reported by VP tool is correct. However, I still don’t understand because this behaviour appears. I´ve also developed this prove with the CUDA stream code of the CUDA SDK. But, it occurs the same. Could it be any wrong configuration in my machine or in my GPU?. Thanks

during debugging, i have empirically observed sufficiently enough to at least be much willing to second any hypothesis proposed that same direction memory copies are/ get interleaved

you may be able to further examine this by comparing the time it takes to complete the memory copies separately - sequentially (one after the other; same stream) - with the time it takes to complete the memory copies concurrently - as is generally your current case

i would think you might find that, when issued sequentially, the individual memory copies complete faster, but with comparable overall time, compared to the memory copies issued concurrently

If you are using streams, you are likely using the Async versions of the cudaMemcpy API. VP reports the beginning of the operation as the point at which the Async call was made. VP reports the end of the operation as the point at which the transfer actually completed.

There is not any transfer overlap on the PCIE bus. There is an overlap of the pending operations, from when they were issued to when they were completed.

You can use cudaStreamSynchronize ahead of the cudaMemcpyAsync call, which will cause the stream sync to eat up all the pending time, and the Async memcpy call will then shrink down to its actual transfer duration and there will be no observed overlap between the Async memcpy calls, in VP.