Let me first post question, and then explain the problem:
Q1. Why does cudaLaunch takes longer (~14us vs ~100us) when using multiple streams with multiple host threads?
Q2. Why is there so much delay (~70us) between kernel launch command & actual execution, even when there is apparently nothing going on? ( no mem copy, no execution, etc)
Using : GTX 580 (that is, Fermi), VS2008, Nsight 2.2
I would very much appreciate an answer on any part of the questions :)
Thanks in advance.
My initial problem was that there is too much delay (~70us) between kernel launch command & actual execution (Q2 - see below diagram)
But I thought maybe this was an intrinsic limitation of Fermi, so I looked for alternative solution.
In order to hide this latency & increase the occupancy,
I tried to divide the workloads into 2 streams. (no inter-stream dependency)
Each stream was controlled by an independent host thread. (2 threads invoked using OpenMP)
My expectation was that 2 threads would ‘fill’ the occupancy gap between kernel launch and execution, thus reducing the execution time.
I can see that naively launching kernels with 2 threads without explicit kernel launch scheduling (as suggested in CUDA C/C++ Streams and Concurrency) would not give the best result, but I expected at least 1.3X~1.5X speedup.
However, the overall execution time slightly increased from 0.45 secs to 0.5 secs.
The main cause was that cudaLaunch took enormous amount of time when using multiple streams. cudaLaunch time now takes about 100us (Q1 - see below diagram), even while the other stream is doing nothing.
I wonder why kernel launch process is taking so much time with more streams.
I would appreciate if anyone hint what I am doing wrong.