Concurrent kernel/optix launch

So this might be a generic CUDA question, but my understanding is that a necessary condition for concurrent kernel execution is that there must be sufficient resources for different kernels. Does that mean the most aggressive code generation for each kernel might not be optimal in that each kernel would be consuming a lot of resources (e.g., registers) that prevents concurrent kernel execution?

I am asking this because I have two back to back optixLaunch in two streams that I was expecting to be executed concurrently. From NSight, I can see that the first launch has a grid size of [90, 1, 1] and block size of [64, 1, 1] and the second launch has a grid size of [85, 1, 1] and a block size of [64, 1, 1]. So by all accounts they seem to be small kernels. But both kernels do consume 106 register/thread as reported in Nsight, which I suspect is what’s preventing the two kernels to be executed concurrently.

Is this a correct understanding?

For concurrent kernel execution in OptiX, you must have separate pipelines for each kernel, otherwise if there is only one pipeline, the kernels will execute serially.

The registers per thread isn’t very likely to affect your ability to have concurrent kernel launches. The number of registers affects your occupancy on the CUDA cores, but doesn’t affect what happens when there are free CUDA cores waiting for work.

That said, CUDA scheduling doesn’t guarantee that concurrently launched kernels will actually execute in parallel. It’s common for the first kernel to occupy all of the cores (if you launched more threads than cores) until it’s near completion, before starting to execute work in a queued up kernel.

Are your small kernels expected to execute for long enough that you would be able to see concurrent execution? With kernels that small, if the workload is light, the first kernel launched may easily finish before the second kernel is even scheduled. The launch time overhead depends on CPU, PCI version, GPU model, etc, but is normally in the tens of microseconds range.


Oh that’s great to know. Thank you. I assume the kernels still have to put in different streams, in addition to being in different pipelines? Also do the different pipelines have to be in different contexts?

Yeah, correct, you’ll need separate streams as well. Kernels (and CUDA API calls) issued on a single stream will automatically serialize. You do not need separate contexts.


Explained here as well:

Thanks. Yeah I was able to put different kernels into different pipelines and they can be overlapped. One question, is creating and managing optix pipelines of high overhead? One strategy is conventional CUDA programming seems to be overscribing streams to give kernels a chance to overlap and let the run-time system do the scheduling. Does that apply to optix pipelines?

Generally speaking, yes, the OptiX strategies are similar to CUDA strategies, and you can gain some performance by queueing up extra work in advance. This is especially true if you have gaps in between serial kernel launches, dependencies on CPU-GPU I/O, or if you have kernels with long-tail behavior where some threads can last much longer than others. If unsure, you can identify whether you have GPU idle time such as gaps between kernels using Nsight Systems. In typical scenarios, 2-4 concurrent kernels launches is enough to fully saturate the GPU and any more than that is not likely to help.