Very long kernel launch overhead on Jetson Orin NX

Hello, we are developing a video image processing application running on Jetson Orin NX. The processing kernels we have written are running with good throughput once they are started, but it takes a long time for them to get started after they are launched.

Here are some results from nsys profiling. It is a small test application just to single out a specific algorithm;
• It is the only process running on the Orin NX, the GPU does nothing else. We are running in headless mode.
• The Orin NX is set to full performance with nvpmodel -m 0 / jetson_clocks.
• We run 100 iterations before the profiling starts to get it properly warmed up, then we sample 4 iterations.
• The main loop comprises seven kernel calls, with cudaDeviceSynchronize() after each kernel (they depend on the output from the previous kernel). The CPU does not really do anything between the kernel calls.
• The first kernel in the loop (illumination_kernel) and the last kernel (final_kernel) are quite long, the five intermediate kernels are very small (four of them are NPP API calls to nppiFilterBoxBorder()).

The nsys command line is:
$ nsys profile --capture-range=cudaProfilerApi --force-overwrite true -o gainTest ./gainTest

Here is a screenshot from nsys-ui showing one full iteration of the main loop:

If one look at one of the smaller kernels (var_a_b_kernel) once can see that the average kernel execution time (KAvg) is 8.984us, but the API time (AAvg) is 108.880us and the queue time (QAvg) is 395.032us which gives a total average (TAvg) of 512.896us.

To us it seems like the overhead is very long, and we do not understand why. When we run this exact same code on an earlier hardware (Jetson Xavier NX) the TAvg is around 110us.

Here is a zoomed version for the var_a_b_kernel:

We do not really know how to investigate the cause of the long overhead. Any suggestions are very much appreciated. Please tell if I should provide more details.

NVIDIA’s embedded products are sufficiently different from the discrete GPUs setup most generic users of CUDA have experience with that question about them receive faster / better / more numerous answers in the forums dedicated to them. In this case (this is a clickable link):

Thanks, I will repost in that forum