Does saturating a stream hide kernel launch latency?

Let’s say I have a series of kernels to launch. They are called sequentially in host code and let’s just assume that they each take a good amount of time (time to complete > latency of kernel launch).

As I understand, if you request multiple kernels to the same stream, they are queued. Can you use this to hide the latency created by calling a kernel from the host?

arguing from the perspective of jit compilation, wouldn’t you always at least be left with the very first kernel call…?

can we further explore kernel launch latency, particularly in the context of dynamic parallelism

for example:

if i move my algorithm functionality into device functions, for readability and reusability and similar good reasons, encapsulated and controlled by a device kernel, i suspect the calling/ inlining of the functions carry little (hardly any) cost

if i move the algorithm functionality into device kernels, called by a device kernel via dynamic parallelism, i have the additional advantage of being able to adjust the kernel dimensions applied to the functionality, now in kernels as opposed to functions - in many cases it is desirable to adjust the dimensions across execution, to take advantage of certain characteristics
i would equally be able to consolidate the overhead performed by single threads of instances into blocks, which may further realize efficiencies
but would i not now suffer increased cost, because i am referencing kernels, as opposed to functions, implying/ requiring compilation, particularly when doing so on an iterative basis?
can the driver compile all child kernels referenced by a parent, and store these (on/ after the first child kernel reference), such that, should the parent repetitively call a child kernel, it only needs to be compiled once really (thus implying little kernel launch overhead overall)…?

You can avoid runtime kernel compilation altogether by specifying the proper device architecture to target. And even if you do runtime/JIT-compilation, this will happen once, at application startup, for all kernels in your code. It will not be repeated just because you are calling said kernels repeatedly.

Finally, there is a JIT cache, so repeated executions of the application may not experience the JIT delay at startup, even if you have not targetted code for the actual device in use.

There should be no kernel call overhead for JIT-compilation apart from the overhead incurred once, at startup. And you eliminate this as well by targetting the device in use with your compile command.

noted, thanks

txbob, isn’t there an (unlikely) exception to the only-JIT-compile once rule? As far as I understood the JIT-compiled code is cached, and there is a generous but limited amount of memory set aside for that. If a particular app with a very large number of JIT-compiled kernels exceeds the size of the code cache, the JIT compilation could run a second time. My memory is extremely hazy, but I seem to recall this scenario from an application with a huge number of kernels generated from templates.

A best practice is to ship CUDA applications or libraries with pre-compiled machine code for all supported architectures (e.g. 2.0, 3.0, 3.5, 5.0) , plus PTX for the most recent GPU architecture (5.0).

Wait, I just realized that I have no idea how the CUDA compiler works. I keep seeing JIT even though I thought it was all static.

Does nvcc create a static copy of the host code but then use JIT to compile the kernels?

OpenCL seems a lot more transparent about this but one thing I like about CUDA is that the parser checks kernel code at the static compilation phase while I think OpenCL leaves all simple syntactic kernel errors to the JIT compiler to complain about.

Is this right?

Because GPU instruction sets differ between architectures, i.e. do not maintain binary compatibility, the CUDA compiler supports two methods to deal with this situation.

(1) Embed kernel code in the executable in an intermediate format, specifically a portable assembly language called PTX. At executable run-time, this intermediate format is JIT-compiled into the machine code appropriate for the GPU detected by the driver.

(2) Embed one or several architecture-specific versions of binary machine code into the executable. This is referred to as a “fat binary”. At run time the driver simply copies the binary code suitable for the detected GPU to the device.

These two approaches are orthogonal to each other. At run time, the driver will first look for an appropriate binary version for a given kernel, if none is found it will then look for appropriate PTX code to be JIT-compiled, and report an error if both previous steps fail.

In practice, one would typically generate binary code for all shipping architecture one plans to support to eliminate any issues with JIT overhead, and in addition generate a single PTX variant for the most recent architecture to achieve forward compatibility with unknown future GPU architectures.

You can extract embedded PTX code by running cuobjdump --dump-ptx on the executable, and extract the embedded binary machine code with cuobjdump --dump-sass.

The compiler options -arch and -codegen allow to embedded one or several variants of SASS (machine code) or PTX, see the CUDA documentation.

njuffa, yes, if the cache is not functional for one reason or another, then repeated executions will experience the JIT-compile delay at startup, for each execution. This is why I said:

“so repeated executions of the application may not experience the JIT delay at startup”

instead of:

“so repeated executions of the application will not experience the JIT delay at startup”

I didn’t think it necessary to delve into it, as the question was really about whether JIT-recompile would be triggered on each execution of a kernel multiple times within the same application/run. It will not, regardless of the state of the JIT cache. At most, JIT-recompile occurs once, at startup.

My apologies for overlooking the “may” versus “will” difference and unnecessarily complicating the discussion. I admit that it actually took me a second to spot the “will” vs “may” difference even with the sentences placed side-by-side. I guess the brains reads what it wants to read :-)

Well I certainly didn’t make it very prominent. But as I said, that was because I felt the focus of the discussion was elsewhere. As you pointed out, the optimal solution (IMO) is to compile explicitly for the intended target device. BTW, I’m very pleased that you still visit, from time to time.

for clarity on my side: if i take the necessary steps to ensure jit compilation is limited to startup only, or eliminate it altogether, i should see no significant difference in overhead, when calling child kernels from a parent, encapsulating kernel, compared to calling ‘child’ functions…?

is there any other kernel launch overhead, in addition to compilation, that would be significant, that could really upset the cost comparison?

also, if i am not mistaken, and if i interpret the material correctly, dynamic parallelism has some overhead in the sense of storing the state of the parent kernel (in global memory), not so?
could this imply child kernels to be more costly than child functions…?

C. Memory Footprint
The device runtime system software reserves memory for various management
purposes, in particular one reservation which is used for saving parent-grid state
during synchronization, and a second reservation for tracking pending grid launches.
Configuration controls are available to reduce the size of these reservations in exchange
for certain launch limitations. See Configuration Options, below, for details.
The majority of reserved memory is allocated as backing-store for parent kernel state, for
use when synchronizing on a child launch. Conservatively, this memory must support
storing of state for the maximum number of live threads possible on the device. This
means that each parent generation at which cudaDeviceSynchronize() is callable
may require up to 150MB of device memory, depending on the device configuration,
which will be unavailable for program use even if it is not all consumed.

To address you earlier concern about function call overhead, the compiler typically inlines most functions in the same compilation unit. It cannot do that when the functions are in separately compiled compilation units, as this would require a linker with function-inlining capability which is linker functionality that doesn’t exist at this time. While function inlining is subject to generous compiler heuristics, programmers also have some additional control via forceinline and noinline function attributes. Note that massive inlining can lead to very large kernel code that takes a long time to compile. Very complex and large kernels could also have a negative impact on some compiler optimizations but this is no different from compilers for CPUs.

I don’t have any practical experience with dynamic parallelism. As far as I know, in CUDA 6.5, the basic latency of launching a minimal kernel from the host vs launching a minimal kernel from the device is about the same and on the order of 5 microseconds on Kepler-class GPUs. I would think it is reasonable to assume that the additional flexibility provided by a solution that uses dynamic parallelism instead of classic host-launched kernels comes at some price in terms of performance, but having never used dynamic parallelism I couldn’t say what the tradeoffs are. txbob or other forum participants may have detailed insights.

You could also simply measure the performance of different configurations relevant to your use case based on a prototype. I am advocating experimental approaches to performance questions rather than trying to predict performance due to this or that code modification as I have found that to be a useful approach in my own performance work. Use of the profiler should prove helpful for determining the first-order bottlenecks in the code.

i think you are starting to confirm my doubts…

consider this:

as you clearly understood, the question is about the cost of replacing within-kernel functions with within-kernel kernels, for a number of reasons - flexibility for one

thus we have:




with the potential to replace it with:




the overarching kernel may contain up to 10, or even more, functions

crucial is that the overarching kernel actually iterates - a lot; at present it iterates by physically looping, conditional on work remaining, but i doubt whether it would change what is implied, if it iterated based on grid dimensions - multiple blocks - for instance

if it indeed takes longer to reference a kernel than a function - to get a kernel going than a function, then replacing the within-kernel functions with within-kernel kernels, it would imply a slow-down, not so?
i equally doubt that whether the function is inlined or not, would make any difference; regardless of whether the function is inlined or not, is is still more expensive to get a within-kernel kernel going than a within-kernel function; agreed?

I am not sure what you are trying to accomplish. A device function call inside a kernel, even when not inlined, is going to be faster than a device-side kernel launch. The latter is not intended to be a replacement for the former, such redundancy wouldn’t make much sense.

Dynamic parallelism is designed to handle scenarios that cannot be handled by simple per-thread device function calls (e.g. a call to a function from CUDA’s standard math library) or host-side kernel launches.

For example, it enables NVIDIA to ship an implementation of CUBLAS whose functions are callable from kernels already running on the device. The implementation of CUBLAS APIs requires threads to co-operate in specific ways, and this can only be accomplished if the CUBLAS thread organization is independent of the thread organization of the calling code. This requires a new kernel to be launched.

Other scenarios appropriate to dynamic parallelism require local decision making where it would be too costly to transport the necessary information back to the host to start a new kernel from there, or where one may want to add some levels of recursion to the process. An example for that I have seen in presentations of dynamic parallelism is the use of an adaptive grid for simulations.


more background:

most, if not all of the within-kernel functions have 2 parts; a control section preceded and/ or succeeded by the actual work section

device function the_function
if threadIdx.x == 0)


if (threadIdx.x == 0)

i run anything between 130 - 190 ‘threads’/ processes/ instances - whatever you wish to call it - as 32-thread blocks at a time
to some extent, it is similar to an (analogy of an) apu, where you switch between the cpu and gpu ‘on the fly’
the control overhead across the ‘threads’/ processes/ instances, done by single threads are accumulating, and the idea is to consolidate this by grouping ‘threads’/ processes/ instances themselves in 32-thread blocks
one realization of the suggestion would be to use dynamic paralellism; although it increasingly seems impractical
i suppose another implementation may be to use bar fencing; i really need to give it more thought

i have given it more thought:

i cannot use bar synching to consolidate overhead, as this would mean that i would need to tie blocks together, implying synchronization across blocks, and that blocks no longer can function independently of each other, which really defies the initial goal

the other hypothetical alternative may be to ‘double launch’, to always have an additional (child) kernel in the queue, with the average execution time of a kernel longer than the overhead

could you break down the kernel launch overhead, kindly please. if overhead due to compilation can be avoided, why is there still a (5ms) overhead occurred with kernel launching, and exactly what resources are utilized/ tied up by kernel launching?
would this only involve device side resources, or host side resources too?
i cannot think that the SMs are heavily involved; perhaps the ‘memory transfer engine’; the whitepaper makes mention of hardware-based work queues and the grid management unit

if kernel launch overhead does not really affect or involve SMs, then i can likely think little of it, from a SM utility perspective, not so?

Note that I stated that the basic kernel launch overhead is on the order of 5 microseconds, not 5 milliseconds. I have no specifics insights what contributes to that overhead, but from what I understand the delay is primarily due to hardware and not due to driver software. Because there are real-life applications that launch many small kernels and are therefore exposed to launch overhead, it is probably safe to assume the driver folks are paying attention to anything on the software side that contributes to launch overhead.

Note: If you are using Windows 7/8 with the default WDDM driver, you may see significantly larger overhead for individual launches as WDDM inherently has a lot of overhead. The CUDA driver tries to mitigate that by batching launches. This reduces the average kernel launch overhead but can increase the launch overhead for a particular launch.

The 5 microseconds I mentioned apply to Linux or Windows 7/8 with TCC driver (also the old Windows XP drivers, but I don’t think those are supported anymore by CUDA at this time?).

“Note that I stated that the basic kernel launch overhead is on the order of 5 microseconds, not 5 milliseconds”

5us, noted; my apologies - i merely read: “more than desired”

“I have no specifics insights what contributes to that overhead, but from what I understand the delay is primarily due to hardware and not due to driver software.”

care to speculate…?

do you think it would significantly draw in and involve SMs? the only way i can imagine it spilling over from queuing hardware to SMs is that the SMs should actually (start to) prepare to schedule the new task’s blocks (at the warp level): collecting the new task’s program such that it can be warp-scheduled. that indeed may lead to a SM execution delay; perhaps also conditional on the warp blocks the SM still has seated

I know that the basic launch overhead has been essentially unchanged for years, so my approach would be to simply adjust software architecture to this basic parameter if necessary. Personally, I prefer an experimental approach to pondering hypotheticals. I do not have sufficient insight into the hardware to speculate intelligently about the contributing factors for the launch overhead; presumably one component is the end-to-end transmission latency of the PCIe interface but I don’t know what that is.

GPUs are designed as throughput devices. In my experience, for the vast majority of use cases the asynchronous nature of kernel launches which allows overlap of GPU and CPU work, as well as the ability to overlap asynchronous host<->device copies with kernel execution will make the kernel launch overhead issue moot. The multiple input queues provided by modern GPUs should eliminate issues with false inter-stream dependencies that could occur due to the single-input queue design of older GPUs.