Possibility of parallelizing H2D transfers with streams

I have read at multiple places that most GPUs have only 1 H2D and D2H engines. If I have 2 matrices to be transferred to the device from the host, does this fact make it impossible to do so parallelly?

I did try it out, and there were no errors. Both matrices were transferred in different streams in MemCpyAsync (But could not visualize them in nvvp or nvprof due to some unknown error).

Does this mean they were transferred in 2 streams parallelly in the same copy engine, or did the memory copies occur serially even with 2 different streams?

Generally it means that they were transferred serially.

I understand. is there any single way that would make it possible to parallelize data transfer streams?

None that I know of.

for what purpose or benefit would it serve, to “parallelize” data transfer streams? It’s not going to make any set of transfers go faster. In the general case (without further controls or specifics) its not going to make anything finish quicker.

Suppose I had a garden hose. Suppose I had 2 tanks of water, one of which is colored red, and one of which is green. Suppose I empty the red tank of water thru the hose, then the green tank of water. Suppose all that takes 10 seconds. Now suppose that I let both tanks empty into the hose at the same time. (Let’s imagine that both tanks of water are side-by-side, at the same height, for you physics majors.) It will not take any less than 10 seconds to empty both tanks (right?). And we can also observe that if both tanks can empty into the hose at the same time, the red tank will not empty any quicker than it would in the first scenario.

I run into this question from time to time. It puzzles me.

The one scenario that I could imagine possibly being beneficial is that if I have a long/large transfer that is going thru the “pipe”, and I suddenly discover that I have a short transfer, then if it could start now, it may finish quicker than it would if it waits for the long transfer to finish. That’s the only scenario where I could see a possible benefit, and it assumes that the short transfer is somehow higher priority than the long transfer. If the longer transfer were of higher priority, this behavior would be detrimental. Therefore, to properly condition such behavior, a sense of transfer priority would be needed. AFAIK CUDA has no such concept. (interruptible transfers would also be needed.)

We don’t have much trouble visualizing the idea that the flow of water out a hose tends to “saturate” the underlying resource - the water flows as fast as it can, roughly speaking. Likewise, it only takes a small mental leap to get to the idea that an H2D (or D2H) transfer probably saturates the underlying resource - the flow of bytes is “as fast as possible”, i.e. as fast as the underlying bus allows.

IMO, we can extend this idea to kernels use of compute resources. I often get the question about concurrent kernels. “Why aren’t these kernels running ‘in parallel’ ?”. But if we apply the same general concept, to a first order approximation, it is neither sensible nor beneficial to expect that two kernels that saturate the underlying resource either could, or would, or should run “in parallel”. To derive any benefit or sense from the question, it is first necessary to demonstrate that neither kernel saturates any relevant resource. Otherwise the question is mostly illogical, in my view. For the exceptional case I previously outlined, in the case of kernel execution there is CUDA stream priority.

The GPU is a garden hose. It has a fixed capacity.

Right, thank you for the detailed explanation. I believe those who were asking the same questions meant to ask if, in your words, there are two garden hoses available. considering that the hoses are the buses and not the GPUs. I failed to realize that one transfer would completely saturate the hose, and hence it makes sense to me. But, if more hoses or transfer buses were available, it would obviously make a difference. It’s a hardware restriction, hence no point questioning it anymore I believe. Again, thank you for your explanation!

To stay with the analogy: Since PCIe interconnect is full duplex, it is the equivalent of a two-hose connection. But there is only one hose in each direction (device->host, host->device).

Technical progress consists of providing “larger diameter hoses”. From PCIe 3 with about 12 GB/sec per direction we progressed to PCIe 4 with 25 GB/sec per direction, and soon we will have PCIe 5 that doubles throughput yet again.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.