Kernel execution time increase 4x when using streams

Hello.

On a rather complex CUDA framework we are working on, I have migrated our code to use streams.
We achieve very good concurrency, but the kernel execution time of the concurrent kernel increase dramatically, negating the streamification effort.

As can be see from the following image from Nsight, concurrency is good, but each kernel use around ~400ms.

In the serial case, we see there is a gap between each kernel invocation (it is actually performing the stream-synchronization, but all kernels queued in stream 1), but kernel execution time is only ~100 ms.

The kernels have the exact same configuration (except the stream argument) in both scenarios.

The kernel is quite memory intensive (traversing a BVH), so my hypothesis is that the shared cache is thrashed when the kernels execute concurrently.

This is using a dual GPU system with a Quadro K6000 and a Tesla K40 using driver 353.62 and CUDA 7.0.

Has anyone observed similar behavior. Is there another explanation for the effect?

-Saturation of the memory controller.

-Thrashing of the L1 caches (assuming concurrent operation of several thread blocks on the same multiprocessor)

-lots of contention when using global atomic access

Show us the code - or at least some nSight kernel metrics for the serial case ;)

from what i can gather, you have 4 streams, yet the device(s) refuse to seat more than 3 kernels concurrently, at any time

these must be ‘fat’ kernels then

i also think it might be misleading to take the kernel execution time at face value, when kernels run concurrently
as long as block execution time, or average execution time is comparable, you should be fine

439us / 3 is more or less 123us

These are fat kernels indeed. I might not have been totally clear in my initial message, but total execution time is higher when I enable four streams than the serial case. (They are roughly 20% higher both for such a block-of-nine launches as shown in my images, and for the entire experiment as a whole.)

I am unable to show the code, but metrics should be okay. Which nSight kernels metrics are most important?

how many kernels do you run in both cases?

for both cases, what are your kernel dimensions?

The kernel dimensions are grid={32,4,1} and {1,128,1} for both scenarios, yielding a 56.25% occupancy.

For both scenarios I run ~1000 invocations of the gridSample-kernel.

(I also execute ~30K datatype-conversion-kernels and there is also copying to GL textures and copying of data between my two GPUs.)

This is running on a complex visualization and computation framework, where which kernels are executed on which GPU is not deterministic. Breaking it down into a smaller benchmark is meaningless, as it is the effect of introducing streams into the framework as whole I am measuring.

That memory intensive kernels are not suitable for streams, and should not be queued as such is certainly meaningful information for us. We are happy with the explanation that the memory controller is saturated.

“That memory intensive kernels are not suitable for streams”

based on what grounds? perhaps you should substantiate

“the memory controller is saturated”

i fail to see the difference in terms of impact on the memory controller when multiple blocks of a single kernel run on a sm, and when multiple blocks of multiple kernels run on a sm
perhaps someone can explain this to me

There is no difference. The proof of that is in the execution time - essentially unchanged.

one kernel executes in time X.

Four kernels executing simultaneously in total time 4X.

The net throughput is the same.

There is no difference.

Problem was, OP was expecting “infinite machine capacity” which is a surprisingly common expectation with GPUs, for some reason. If I can get four times as much parallelism exposed, the machine should run four times faster. That is true roughly up until you hit one of the machine limits. Then it flatlines (plotting machine throughput vs. exposed parallelism).

It’s not been proven beyond a shadow of a doubt in this case, but I think it’s a likely explanation for the observation (and in light of OP’s statement: “The kernel is quite memory intensive”).

contrary to the view that memory intensive kernels are not suitable for streams, i would think that a certain class of memory intensive kernels are indeed very suitable for streams; particularly when the device memory footprint is significant, and memory transfers are significant

a rough interpretation of the serial case profile output is that the device only spends 75% of available time on compute, due to synchronization calls and memory transfers

the stream case profile output seems to have a broken memory transfer pattern, and perhaps as a result, a broken kernel pattern, which may point to a synchronization method that is not ‘optimal’