kernel launch latency

Is it possible to get kernel launches down to under 3us? whats the lowest theoretical limit?

CUDA 9.2 release notes state:

  • Launch CUDA kernels up to 2X faster than CUDA 9 with new optimizations to the CUDA runtime

so try an upgrade to CUDA 9.2!

Also use texture objects and not texture references in your kernels, as each used texture reference comes with additional launch overhead

The lowest launch latencies are on Linux, and with the TCC driver on Windows. With the default WDDM driver on Windows, you will likely see launch latencies fluctuating between 5us and 20us, as a consequence of design decisions made by Microsoft (basically, trying to imposed greater OS control on the GPU).

For the past ten years, the lowest achievable launch latency for null kernels has been around 5 us, regardless of hardware, which made me believe that this is a lower limit caused by PCIe latencies, which have not really changed as PCIe matured.

So I was surprised to hear that CUDA 9.2 managed to magically reduce launch latency by up to a factor of two, and am wondering whether this also includes slashing the minimum achievable launch latency (previously 5 usec). Should you experiment with CUDA 9.2, reporting your findings here would probably be of interest to a number of people.

Generally speaking, host-side overheads in CUDA will be minimized by use of a CPU with high single-thread performance. My recommendation is for CPUs with a base operating clock of >= 3.5 GHz.

Using persistent kernels and dynamic parallelism might help reducing latencies a bit.

It just makes giving new work to the GPU and retrieving solutions a bit more complicated as data will then have to be passed around with unified or zero copy memory in some sort of work queue.

As far as I know (haven’t played with it in a long time), dynamic parallelism (kernels launches initiated from the GPU) does not make a noticeable difference in terms of launch latency.

I know nothing about the internal hardware that enables dynamic parallelism, but from a design simplicity perspective it seems reasonable to assume that this simply piggy-backs onto the existing hardware interfaces and command queuing structures that handle commands sent from the CPU via PCIe, and therefore is subject to the similar overhead.

I agree it may be worth a try if one is already pushing hard against the traditional kernel launch issue limit of 200,000 launches per second.

persistent kernels. thats a new one. so its like a busy-wait? that would improve launch time and most of the time would be the memory transfers with the cpu.

do you know the easiest way to do busy-wait on a gpu?

can i ask a subsequent question, what is the fastest way to get fresh data in and out of the device arrays? the size never changes, just replacing the data in the input array.

host array (fresh data) -> device array -> sgemm etc -> host array (result).

by the way, i have one other question, if im doing matrix-mul, i will have to hand code it because no way to do sgemm with persistent kernel right? sgemm is going to kill the response times.

Asynchronous copies using page locked memory might be fastest. if you want to overlap memory transfers and computation, you will have to look into the cuda streams API. Also you will need multiple buffers to allow for robin operation for the in-transfers, computation and out-transfers on these buffers.

Overlapping does not reduce latency for a single matrix operation but it might allow the GPU to do uninterrupted computation while the transfers in and out of the GPU are are taking place. So this maximizes the overall arithmetic throughput.

i installed 9.2 on linux

cublas sgemm takes 5.5us total time with an 1x1 matrix. which is similar to historic best latencies. shame they didnt actually improve anything.

Thanks for closing the loop. I can’t say I am surprised, since I concluded years ago that the 5 usec minimum kernel launch latency is a function of the PCIe interface (as noted in #3 above).

There may be an improvement in launch times, just not a reduction of the minimum launch time.

pcie 4.0 or 5 might improve it. will be interesting to see in next 12 months.

Possible, but unlikely. New PCIe versions tend to increase throughput without improving latency, thus confirming the old engineering adage: “Money can buy bandwidth, but latency is forever”.

10gbe Network cards do it in similar 5us time, but with special drivers they do it in less than 1 microsecond. So there is no reason Nvidia can’t do it. I just think they don’t care.

Presumably the latency is not just a function of the endpoint (here: GPU), but also the root complex (in the CPU). The fact that the PCIe signal path is going through a slot connector is probably not helping matters.

I don’t know what is technically possible with PCIe. The only paper I could find on the spot on Google Scholar with data on network adapters using an PCIe interface clearly shows 5 microseconds as the lower limit of latency in the scatter plot.

If the “true” technical limit on latency were lower, there could still be undesirable technical trade-offs in achieving it that may make sense for network adapters but not graphics cards. Also, there could be economic obstacles to achieving a lower latency: With products sold for profit, any feature that drives up cost without increasing up sales could be a non-starter.

I found a site that claims users can configure the PCIe latency themselves (presumably with stability as a trade-off):

I am skeptical because PCI != PCIe, and PCIe is not a bus, it’s a point-to-point interconnect.

the network cards i refer to are primarily slot connected (pcie 8x or 16x, eg Solarflare) so i think its applicable to gpus.

i wonder if a TCC windows driver is lower latency than cuda linux? possibly since it doesnt support video they can be more flexible on pcie timings. is there something comparable to tcc on linux?

I have been told by multiple people who have used Windows with TCC driver and Linux on the same hardware platform is that Windows with the TCC driver offers a bit better performance due to overall slightly lower overhead. Some actual performance numbers I was shown indicated that the differences were tiny and not practically relevant. I consider performance differences under 2% noise.

I have never observed minimum kernel launch times below the 5 microsecond mark on any OS platform supported by CUDA, with any driver. Of course, you don’t have to rely on other people’s observations and can just give it a try yourself if you use a GPU that is supported by the TCC driver.

Maybe I am misunderstanding what Solarflare writes about their products, but on their website I read (emphasis mine):

I think it is unlikely that a kernel launch is strictly a send operation. At minimum, so blocking APIs can work and for the proper handling of (circular) command buffers, a GPU should send notifications indicating completed work units back to the CPU. So a launch is presumably of the form “process notifications received from GPU for completed previous work, then send data to GPU for new work”. The latency of a kernel launch would therefore be equivalent to the half round trip latency quoted for network cards. I admit this is (hopefully intelligent) speculation.

If you don’t want to rely on speculation, you could hook up a logic analyzer to the PCIe interconnect and study the traffic patterns as you are blasting a stream of kernel launches to the GPU.

frankly I don’t know if it’s transferable to GPUs but they are delivering packets from the NIC over the PCIe bus to a user application in under 1us. Mellanox is down around 600us. To get below this we keep application logic entirely on the nic to avoid the bus. Notwithstanding we are already under 2us return trip to user space and have been for at least 7 years.

Re: solar flare as I said you need to read about the various driver modes. Eg:

Onload is a high-performance POSIX compliant network stack from Solarflare that dramatically reduces latency and x86 utilization, while increasing throughput and reducing latency. Onload supports TCP/UDP/IP network protocols by providing a standard Berkley Sockets Direct (BSD) Application Programming Interface (API), and requires no modifications to end user applications. Onload delivers half round trip latency in the 1,000 nanosecond range, by contrast the typical kernel stack latency is about 7,000 nanoseconds. It achieves these performance improvements in part by performing network processing in user, bypassing the OS kernel entirely, reducing the number of data copies, and kernel context switches.

Well, I grabbed the top-most advertised number from the website. If the marketing people at Solarflare know what they are doing, they will put the most relevant result there :-)

This is for receiving, not for sending. Therefore I am not sure how this relates to kernel launches. We know for sure that kernel launches must include a send operation, and they may include a receive operation by reasonable conjecture.

In any event, I don’t think we will resolve the issue of minimal kernel launch latency in this forum thread. You might consider filing a request for enhancement with NVIDIA, or if your application relies crucially on lowering the launch latency and you are (directly or indirectly) responsible for a purchasing decision on a good-sized batch of GPUs, maybe contact NVIDIA directly at one of their sales offices. Money talks.