Kernel launch latency and streams

duud · September 12, 2017, 11:52am

We are observing an increase in kernel launch latency when using streams. The increase is significant since the kernel running time is about 0.3ms, there is no more work to increase the kernel running time. This makes the whole point of using streams to overlap data transfers pointless. This effect is more noticeable on low speed PCIE interfaces. Is this behaviour currently unavoidable?

tera · September 13, 2017, 11:34am

What device / compute capability?
On older devices (pre-3.5) the order on which kernels are launched into independent streams is significant.

duud · September 13, 2017, 11:48am

1080Ti / 6.1

One stream is used for queuing up kernel launches.
The second stream is used for doing overlapped data transfers only.

tera · September 13, 2017, 8:01pm

None of my hypotheses seem to match then.

I’ve not noticed such behaviour on Pascal cards. So we’d need more concrete information, e.g. example code, to help you.

njuffa · September 13, 2017, 10:32pm

Can you state the actual launch latency you are observing? With a PCIe gen3 interface, on a non-Windows-WDDM platform, you should see about 5 usec, and this should be pretty much invariant of the features used by the kernel.

Can you clarify “low speed PCIE interface”? As in PCIe gen 2? Ideally, you should not be using low-speed PCIe interfaces. As far as I am aware, the lauch overhead of 5 usec is pretty much due to the overhead of the PCIe hardware itself.

What GPU are we talking about here, and what is the system platform?

Topic		Replies	Views
Overlapping kernel computing with stream per (CPU) thread, slow kernel launches CUDA Programming and Performance	10	3667	October 21, 2017
reduces kernel launch latency? CUDA Programming and Performance	6	12944	July 6, 2008
kernel launch latency CUDA Programming and Performance	16	7765	August 6, 2018
CUDA Graphs Impact CUDA Programming and Performance	2	491	September 17, 2021
Why kernel executions in different streams are not parallel? CUDA Programming and Performance	4	2656	April 29, 2019
cudaMemcpyPeerAsync Launch Overhead CUDA Programming and Performance	2	311	October 4, 2023
host-device latencies? CUDA Programming and Performance	2	843	March 1, 2019
concurrent kernels CUDA Programming and Performance	2	848	May 2, 2011
kernel launch time way too long CUDA Programming and Performance	6	4024	July 5, 2011
Using multiple streams with multiple host threads takes longer? stream CUDA Programming and Performance	3	899	February 10, 2021

Kernel launch latency and streams

Related topics