Why would multiple compute streams be very inefficient ?

I am working on a CFD problem which requires the concurrent computation of a halo, transfer of the new halo while also computing the internal halo. I have tried to accelerate this with multiple compute streams to process the internal halo, but the streams are not as efficient as I hoped.

With 8 compute streams the 1st stream performs excellent in terms of time to compute its part of the internal halo, but the 2nd stream takes twice as along, and the rest of the streams each take 10 times longer with very little overlap. The result is very little if any acceleration, even though the 1st stream performs excellent.

Why could this be so?

Is this a synch problem? Or is there a wait for resources to be released from one stream to the next?

If I could upload a screenshot from nvvp…

This problem has been resolved and is related to the physical problem being simulated leading to warp divergence in those streams executing the longest.

I’ve been trying to understand why multiple streams to process data without async copy are not giving acceleration.

I’ve used nvvp to time 8,16 and 32 streams compute a kernel and got this table:

Avg Duration	Avg Start Diff
8	0.000696	0.0004075
16	0.00052	        0.000218
32	0.00041	        0.000108

(Times in microseconds)
Very close to linear difference in start times (double the streams, halve the difference in start time) , but although the duration for each stream decreases with the number of streams it is nowhere near to linear so the code actually decelerates.

And even if the duration was linear it wouldn’t be worth it.

Why would this be so?

I was hoping to chop a relatively long computation into multiple compute streams and run those streams in parallel to decrease overall computation time, but this is looking increasingly futile.

Why does the difference in start times between streams seem to depend on the number of streams?

Why is this not a constant?

The machine has a limited capacity. If you run up against a capacity limit, it’s not reasonable to assume that further exposure of parallel work will make things run faster.

Furthermore, resource utilization will tend to slow parallel work down. If I have a particular code that uses some memory cycles, and I run two instances of that code, it’s not reasonable to assume that the memory requests of one are completely unhindered by the memory requests of the other.