Kernel launch latency and streams

We are observing an increase in kernel launch latency when using streams. The increase is significant since the kernel running time is about 0.3ms, there is no more work to increase the kernel running time. This makes the whole point of using streams to overlap data transfers pointless. This effect is more noticeable on low speed PCIE interfaces. Is this behaviour currently unavoidable?

What device / compute capability?
On older devices (pre-3.5) the order on which kernels are launched into independent streams is significant.

1080Ti / 6.1

One stream is used for queuing up kernel launches.
The second stream is used for doing overlapped data transfers only.

None of my hypotheses seem to match then.

I’ve not noticed such behaviour on Pascal cards. So we’d need more concrete information, e.g. example code, to help you.

Can you state the actual launch latency you are observing? With a PCIe gen3 interface, on a non-Windows-WDDM platform, you should see about 5 usec, and this should be pretty much invariant of the features used by the kernel.

Can you clarify “low speed PCIE interface”? As in PCIe gen 2? Ideally, you should not be using low-speed PCIe interfaces. As far as I am aware, the lauch overhead of 5 usec is pretty much due to the overhead of the PCIe hardware itself.

What GPU are we talking about here, and what is the system platform?