Using multiple streams with multiple host threads takes longer? stream

Hi all,

Let me first post question, and then explain the problem:

Q1. Why does cudaLaunch takes longer (~14us vs ~100us) when using multiple streams with multiple host threads?
Q2. Why is there so much delay (~70us) between kernel launch command & actual execution, even when there is apparently nothing going on? ( no mem copy, no execution, etc)

Using : GTX 580 (that is, Fermi), VS2008, Nsight 2.2

I would very much appreciate an answer on any part of the questions :)
Thanks in advance.

Problem description:

My initial problem was that there is too much delay (~70us) between kernel launch command & actual execution (Q2 - see below diagram)

But I thought maybe this was an intrinsic limitation of Fermi, so I looked for alternative solution.

In order to hide this latency & increase the occupancy,
I tried to divide the workloads into 2 streams. (no inter-stream dependency)
Each stream was controlled by an independent host thread. (2 threads invoked using OpenMP)

My expectation was that 2 threads would ‘fill’ the occupancy gap between kernel launch and execution, thus reducing the execution time.
I can see that naively launching kernels with 2 threads without explicit kernel launch scheduling (as suggested in CUDA C/C++ Streams and Concurrency) would not give the best result, but I expected at least 1.3X~1.5X speedup.

However, the overall execution time slightly increased from 0.45 secs to 0.5 secs.

The main cause was that cudaLaunch took enormous amount of time when using multiple streams. cudaLaunch time now takes about 100us (Q1 - see below diagram), even while the other stream is doing nothing.

I wonder why kernel launch process is taking so much time with more streams.

I would appreciate if anyone hint what I am doing wrong.

1 Like

This post is still relevant. I see this gap when opening new streams from new host threads.
The version of CUDA is 10.2 and the GPU is RTX 5000.

An explanation about using concurrent streams in multiple host threads will be appreciated.

One of the issues is that the cuda runtime (i.e. the library used for all cuda runtime API calls, as well as kernel launches, etc.) often has a design based on locks that must be acquired to perform certain activities. I wouldn’t be able to spell this out in detail, but it’s not difficult to show with appropriate system debug and analysis tools that calls into the CUDA runtime library often acquire a host system lock of some sort.

This behavior necessarily results in serialization when multiple host threads are using the runtime library. The degree to which this affects a specific code is code-specific, it’s hard to generalize, but you can find a variety of other questions here on these forums with test cases demonstrating that cuda runtime API calls very often take longer (have a longer latency to return control to the host thread) in a multithreaded scenario.

Of course we would prefer it if there were no such thing, but that would be like wishing for a lockless design. Such a thing is non-trivial to undertake and might not yield performance improvement if other trade-offs had to be made to achieve it.

This is clearly a hand-waving response. Whether or not the locking behavior I mention here is a small or large contributor to this particular case I cannot say. However I know it plays a role/is a factor in some cases where I have seen analysis performed.

You’re welcome to file bugs with behavior that you find disappointing. I’m sure our dev teams would like to have additional cases to study.


Thank you for the answer. Knowing that locks within CUDA runtime library are might be the cause of this gap is very helpful.