Concurrent kernel/optix launch

boringboringarsenal · July 14, 2021, 9:24pm

So this might be a generic CUDA question, but my understanding is that a necessary condition for concurrent kernel execution is that there must be sufficient resources for different kernels. Does that mean the most aggressive code generation for each kernel might not be optimal in that each kernel would be consuming a lot of resources (e.g., registers) that prevents concurrent kernel execution?

I am asking this because I have two back to back optixLaunch in two streams that I was expecting to be executed concurrently. From NSight, I can see that the first launch has a grid size of [90, 1, 1] and block size of [64, 1, 1] and the second launch has a grid size of [85, 1, 1] and a block size of [64, 1, 1]. So by all accounts they seem to be small kernels. But both kernels do consume 106 register/thread as reported in Nsight, which I suspect is what’s preventing the two kernels to be executed concurrently.

Is this a correct understanding?

dhart · July 14, 2021, 9:44pm

For concurrent kernel execution in OptiX, you must have separate pipelines for each kernel, otherwise if there is only one pipeline, the kernels will execute serially.

The registers per thread isn’t very likely to affect your ability to have concurrent kernel launches. The number of registers affects your occupancy on the CUDA cores, but doesn’t affect what happens when there are free CUDA cores waiting for work.

That said, CUDA scheduling doesn’t guarantee that concurrently launched kernels will actually execute in parallel. It’s common for the first kernel to occupy all of the cores (if you launched more threads than cores) until it’s near completion, before starting to execute work in a queued up kernel.

Are your small kernels expected to execute for long enough that you would be able to see concurrent execution? With kernels that small, if the workload is light, the first kernel launched may easily finish before the second kernel is even scheduled. The launch time overhead depends on CPU, PCI version, GPU model, etc, but is normally in the tens of microseconds range.

–
David.

boringboringarsenal · July 14, 2021, 10:09pm

Oh that’s great to know. Thank you. I assume the kernels still have to put in different streams, in addition to being in different pipelines? Also do the different pipelines have to be in different contexts?

dhart · July 14, 2021, 10:20pm

Yeah, correct, you’ll need separate streams as well. Kernels (and CUDA API calls) issued on a single stream will automatically serialize. You do not need separate contexts.

–
David.

droettger · July 15, 2021, 6:45am

Explained here as well:

https://raytracing-docs.nvidia.com/optix7/guide/index.html#ray_generation_launches#ray-generation-launches

https://forums.developer.nvidia.com/t/optix7-0-could-i-use-two-streams-for-two-optixlaunch-operation-in-two-threads-for-speed-optimize/156764/2

boringboringarsenal · July 15, 2021, 12:35pm

Thanks. Yeah I was able to put different kernels into different pipelines and they can be overlapped. One question, is creating and managing optix pipelines of high overhead? One strategy is conventional CUDA programming seems to be overscribing streams to give kernels a chance to overlap and let the run-time system do the scheduling. Does that apply to optix pipelines?

dhart · July 15, 2021, 2:31pm

Generally speaking, yes, the OptiX strategies are similar to CUDA strategies, and you can gain some performance by queueing up extra work in advance. This is especially true if you have gaps in between serial kernel launches, dependencies on CPU-GPU I/O, or if you have kernels with long-tail behavior where some threads can last much longer than others. If unsure, you can identify whether you have GPU idle time such as gaps between kernels using Nsight Systems. In typical scenarios, 2-4 concurrent kernels launches is enough to fully saturate the GPU and any more than that is not likely to help.

–
David.

Topic		Replies	Views
Issues running OptiX concurrently with a CUDA kernel that uses shared memory OptiX	2	564	December 11, 2023
Optix7.0: Could I use two streams for two optixLaunch operation in two threads for speed-optimize? OptiX	2	1030	June 14, 2022
Optix 6.5 - interleaving CUDA kernels OptiX	2	739	October 12, 2021
Overlapping kernel computing with stream per (CPU) thread, slow kernel launches CUDA Programming and Performance	10	3655	October 21, 2017
Kernel launch concurrency CUDA Programming and Performance	10	1801	December 11, 2014
Is it possible to execute kernels in parallel CUDA Programming and Performance	9	4569	February 6, 2009
Parallel Kernels Best practices for creating a pipeline CUDA Programming and Performance	7	4694	June 1, 2007
Concurrent kernels execution using streams in multiple CPU threads CUDA Programming and Performance	7	10597	June 26, 2012
CUDA Parallel Kernels CUDA Programming and Performance	2	6877	May 24, 2009
CUDA 4.0 concurrent kernels CUDA Programming and Performance	6	1670	March 28, 2011

Concurrent kernel/optix launch

Related topics