Very long kernel launch overhead on Jetson Orin NX

henrik.engstrom · September 20, 2024, 10:03am

Hello, we are developing a video image processing application running on Jetson Orin NX. The processing kernels we have written are running with good throughput once they are started, but it takes a long time for them to get started after they are launched.

Here are some results from nsys profiling. It is a small test application just to single out a specific algorithm;
• It is the only process running on the Orin NX, the GPU does nothing else. We are running in headless mode.
• The Orin NX is set to full performance with nvpmodel -m 0 / jetson_clocks.
• We run 100 iterations before the profiling starts to get it properly warmed up, then we sample 4 iterations.
• The main loop comprises seven kernel calls, with cudaDeviceSynchronize() after each kernel (they depend on the output from the previous kernel). The CPU does not really do anything between the kernel calls.
• The first kernel in the loop (illumination_kernel) and the last kernel (final_kernel) are quite long, the five intermediate kernels are very small (four of them are NPP API calls to nppiFilterBoxBorder()).

The nsys command line is:
$ nsys profile --capture-range=cudaProfilerApi --force-overwrite true -o gainTest ./gainTest

Here is a screenshot from nsys-ui showing one full iteration of the main loop:

If one look at one of the smaller kernels (var_a_b_kernel) once can see that the average kernel execution time (KAvg) is 8.984us, but the API time (AAvg) is 108.880us and the queue time (QAvg) is 395.032us which gives a total average (TAvg) of 512.896us.

To us it seems like the overhead is very long, and we do not understand why. When we run this exact same code on an earlier hardware (Jetson Xavier NX) the TAvg is around 110us.

Here is a zoomed version for the var_a_b_kernel:

We do not really know how to investigate the cause of the long overhead. Any suggestions are very much appreciated. Please tell if I should provide more details.

njuffa · September 20, 2024, 6:43pm

NVIDIA’s embedded products are sufficiently different from the discrete GPUs setup most generic users of CUDA have experience with that question about them receive faster / better / more numerous answers in the forums dedicated to them. In this case (this is a clickable link):

henrik.engstrom · September 21, 2024, 4:30pm

Thanks, I will repost in that forum

Topic		Replies	Views
kenel overhead time in Jetson TX1? Jetson TX1	5	775	October 18, 2021
Kernel enqueue overhead Bringing kernel overhead down? CUDA Programming and Performance	9	13846	March 12, 2010
overhead between two successive kernel calls CUDA Programming and Performance	6	1836	July 7, 2013
Improve kernel launch times on Jetson TX2? Jetson TX2	4	662	October 18, 2021
Orin cuda graph latency is too long Jetson AGX Orin cuda	19	297	December 19, 2024
kernel call overhead: timing results overhead is large for small # of calls CUDA Programming and Performance	16	7954	March 8, 2013
Variable run time for cuda kernel Jetson AGX Orin cuda	3	1288	March 2, 2023
reduce overhead of launching a new thread block CUDA Programming and Performance	15	4816	February 15, 2018
Kernel execution overhead CUDA Programming and Performance	2	1198	July 6, 2009
Performance Discrepancy Between Jetson Orin Nano and Orin NX Jetson Orin NX python	2	201	June 16, 2025

Very long kernel launch overhead on Jetson Orin NX

Related topics