Kernel execution time increase 4x when using streams

Johan.Seland · August 12, 2015, 10:00am

Hello.

On a rather complex CUDA framework we are working on, I have migrated our code to use streams.
We achieve very good concurrency, but the kernel execution time of the concurrent kernel increase dramatically, negating the streamification effort.

As can be see from the following image from Nsight, concurrency is good, but each kernel use around ~400ms.

In the serial case, we see there is a gap between each kernel invocation (it is actually performing the stream-synchronization, but all kernels queued in stream 1), but kernel execution time is only ~100 ms.

The kernels have the exact same configuration (except the stream argument) in both scenarios.

The kernel is quite memory intensive (traversing a BVH), so my hypothesis is that the shared cache is thrashed when the kernels execute concurrently.

This is using a dual GPU system with a Quadro K6000 and a Tesla K40 using driver 353.62 and CUDA 7.0.

Has anyone observed similar behavior. Is there another explanation for the effect?

cbuchner1 · August 12, 2015, 10:03am

-Saturation of the memory controller.

-Thrashing of the L1 caches (assuming concurrent operation of several thread blocks on the same multiprocessor)

-lots of contention when using global atomic access

Show us the code - or at least some nSight kernel metrics for the serial case ;)

little_jimmy · August 12, 2015, 11:22am

from what i can gather, you have 4 streams, yet the device(s) refuse to seat more than 3 kernels concurrently, at any time

these must be ‘fat’ kernels then

i also think it might be misleading to take the kernel execution time at face value, when kernels run concurrently
as long as block execution time, or average execution time is comparable, you should be fine

439us / 3 is more or less 123us

Johan.Seland · August 12, 2015, 11:47am

These are fat kernels indeed. I might not have been totally clear in my initial message, but total execution time is higher when I enable four streams than the serial case. (They are roughly 20% higher both for such a block-of-nine launches as shown in my images, and for the entire experiment as a whole.)

I am unable to show the code, but metrics should be okay. Which nSight kernels metrics are most important?

little_jimmy · August 12, 2015, 12:49pm

how many kernels do you run in both cases?

for both cases, what are your kernel dimensions?

Johan.Seland · August 12, 2015, 1:19pm

The kernel dimensions are grid={32,4,1} and {1,128,1} for both scenarios, yielding a 56.25% occupancy.

For both scenarios I run ~1000 invocations of the gridSample-kernel.

(I also execute ~30K datatype-conversion-kernels and there is also copying to GL textures and copying of data between my two GPUs.)

This is running on a complex visualization and computation framework, where which kernels are executed on which GPU is not deterministic. Breaking it down into a smaller benchmark is meaningless, as it is the effect of introducing streams into the framework as whole I am measuring.

That memory intensive kernels are not suitable for streams, and should not be queued as such is certainly meaningful information for us. We are happy with the explanation that the memory controller is saturated.

little_jimmy · August 13, 2015, 4:57am

“That memory intensive kernels are not suitable for streams”

based on what grounds? perhaps you should substantiate

“the memory controller is saturated”

i fail to see the difference in terms of impact on the memory controller when multiple blocks of a single kernel run on a sm, and when multiple blocks of multiple kernels run on a sm
perhaps someone can explain this to me

Robert_Crovella · August 13, 2015, 5:27am

There is no difference. The proof of that is in the execution time - essentially unchanged.

one kernel executes in time X.

Four kernels executing simultaneously in total time 4X.

The net throughput is the same.

There is no difference.

Problem was, OP was expecting “infinite machine capacity” which is a surprisingly common expectation with GPUs, for some reason. If I can get four times as much parallelism exposed, the machine should run four times faster. That is true roughly up until you hit one of the machine limits. Then it flatlines (plotting machine throughput vs. exposed parallelism).

It’s not been proven beyond a shadow of a doubt in this case, but I think it’s a likely explanation for the observation (and in light of OP’s statement: “The kernel is quite memory intensive”).

little_jimmy · August 13, 2015, 1:23pm

contrary to the view that memory intensive kernels are not suitable for streams, i would think that a certain class of memory intensive kernels are indeed very suitable for streams; particularly when the device memory footprint is significant, and memory transfers are significant

a rough interpretation of the serial case profile output is that the device only spends 75% of available time on compute, due to synchronization calls and memory transfers

the stream case profile output seems to have a broken memory transfer pattern, and perhaps as a result, a broken kernel pattern, which may point to a synchronization method that is not ‘optimal’

Topic		Replies	Views
Why kernel executions in different streams are not parallel? CUDA Programming and Performance	4	2459	April 29, 2019
Kernels executing concurrently in different streams do not behave as expected CUDA Programming and Performance	6	374	December 20, 2023
Concurrent Kernel execution overhead CUDA Programming and Performance	4	925	March 20, 2011
Kernel executed in non-default CUDA stream waits for other streams to complete cudaMemcpyAsync CUDA Programming and Performance cuda	15	66	November 18, 2024
Is it possible to execute kernels in parallel CUDA Programming and Performance	9	4566	February 6, 2009
Inconsistent kernel execution times, and affected by Nsight Systems CUDA Programming and Performance	1	301	April 23, 2024
Same kernel and data exhibits different performance CUDA Programming and Performance	3	477	December 3, 2021
Questions about concurrent kernels execution CUDA Programming and Performance	2	103	June 18, 2024
Kernels in CUDA streams seems not running in parallel Profiling Linux Targets	8	794	April 7, 2024
Concurrent Kernels Bug / Undocumented Behavior (Urgent) need info on "simple" problem with c CUDA Programming and Performance	2	905	June 18, 2010

Kernel execution time increase 4x when using streams

Related topics