None that I know of.
for what purpose or benefit would it serve, to “parallelize” data transfer streams? It’s not going to make any set of transfers go faster. In the general case (without further controls or specifics) its not going to make anything finish quicker.
Suppose I had a garden hose. Suppose I had 2 tanks of water, one of which is colored red, and one of which is green. Suppose I empty the red tank of water thru the hose, then the green tank of water. Suppose all that takes 10 seconds. Now suppose that I let both tanks empty into the hose at the same time. (Let’s imagine that both tanks of water are side-by-side, at the same height, for you physics majors.) It will not take any less than 10 seconds to empty both tanks (right?). And we can also observe that if both tanks can empty into the hose at the same time, the red tank will not empty any quicker than it would in the first scenario.
I run into this question from time to time. It puzzles me.
The one scenario that I could imagine possibly being beneficial is that if I have a long/large transfer that is going thru the “pipe”, and I suddenly discover that I have a short transfer, then if it could start now, it may finish quicker than it would if it waits for the long transfer to finish. That’s the only scenario where I could see a possible benefit, and it assumes that the short transfer is somehow higher priority than the long transfer. If the longer transfer were of higher priority, this behavior would be detrimental. Therefore, to properly condition such behavior, a sense of transfer priority would be needed. AFAIK CUDA has no such concept. (interruptible transfers would also be needed.)
We don’t have much trouble visualizing the idea that the flow of water out a hose tends to “saturate” the underlying resource - the water flows as fast as it can, roughly speaking. Likewise, it only takes a small mental leap to get to the idea that an H2D (or D2H) transfer probably saturates the underlying resource - the flow of bytes is “as fast as possible”, i.e. as fast as the underlying bus allows.
IMO, we can extend this idea to kernels use of compute resources. I often get the question about concurrent kernels. “Why aren’t these kernels running ‘in parallel’ ?”. But if we apply the same general concept, to a first order approximation, it is neither sensible nor beneficial to expect that two kernels that saturate the underlying resource either could, or would, or should run “in parallel”. To derive any benefit or sense from the question, it is first necessary to demonstrate that neither kernel saturates any relevant resource. Otherwise the question is mostly illogical, in my view. For the exceptional case I previously outlined, in the case of kernel execution there is CUDA stream priority.
The GPU is a garden hose. It has a fixed capacity.