Unstable CUDA timing on Jetson AGX Orin compared to Windows GPU

Hi everyone,
I am new to CUDA and Jetson. I have been doing some exercises to better understand how the whole thing behaves in practice.

I wrote a simple C/CUDA program that computes 256 FFTs of 1024 complex samples.
I use pinned host memory and repeat the operation **1000 times (**At each iteration I execute 256 batched FFTs of 1024 complex samples using cuFFT (batch = 256)) (discarding the first 100 iterations). The code structure looks like the following:

fill source buffer with some data

for i = 0 to 1000
{
copy data from Host to Device (H2D)
compute FFT using cufftExecC2C
copy data from Device to Host (D2H)
}

Using CUDA events, I measure:

  • H2D copy time

  • FFT execution time

  • D2H copy time

  • total time per iteration

I then ran the exact same code on my laptop and on a Jetson AGX Orin, and the results are very different. Test configurations:
System 1: Laptop + Windows 10 + NVIDIA RTX A2000
System 2: Jetson AGX Orin + Jetson Linux R35 + Ubuntu 20.04.6 LTS +CUDA 11.4

Results on Windows (1000 iterations)

This is a partial output from my program:

…
H2D: 0.198816 ms, FFT: 0.030560 ms, D2H: 0.184608 ms, TOT: 0.413984 ms
H2D: 0.305504 ms, FFT: 0.027072 ms, D2H: 0.204800 ms, TOT: 0.537376 ms
H2D: 0.217920 ms, FFT: 0.025760 ms, D2H: 0.186112 ms, TOT: 0.429792 ms
H2D: 0.230336 ms, FFT: 0.026848 ms, D2H: 0.184800 ms, TOT: 0.441984 ms
H2D: 0.214368 ms, FFT: 0.027136 ms, D2H: 0.185472 ms, TOT: 0.426976 ms
H2D: 0.211008 ms, FFT: 0.026400 ms, D2H: 0.184896 ms, TOT: 0.422304 ms
H2D: 0.206048 ms, FFT: 0.025696 ms, D2H: 0.186016 ms, TOT: 0.417760 ms
H2D: 0.205152 ms, FFT: 0.026176 ms, D2H: 0.200416 ms, TOT: 0.431744 ms
H2D: 0.205664 ms, FFT: 0.026848 ms, D2H: 0.200960 ms, TOT: 0.433472 ms
H2D: 0.249184 ms, FFT: 0.036096 ms, D2H: 0.185632 ms, TOT: 0.470912 ms
H2D: 0.306688 ms, FFT: 0.025920 ms, D2H: 0.200992 ms, TOT: 0.533600 ms
H2D: 0.216608 ms, FFT: 0.027136 ms, D2H: 0.185056 ms, TOT: 0.428800 ms
…

H2D over last 1000 samples:
min = 0.198816 ms
mean = 0.220522 ms
max = 0.373600 ms

FFT over last 1000 samples:
min = 0.018720 ms
mean = 0.023816 ms
max = 0.061504 ms

D2H over last 1000 samples:
min = 0.179680 ms
mean = 0.187218 ms
max = 0.359936 ms

TOT over last 1000 samples:
min = 0.406240 ms
mean = 0.431556 ms
max = 0.723488 ms

Timing is quite stable and predictable.

Results on Jetson AGX Orin (1000 iterations)

Before running the test:

  1. Power mode set to MAXN

  2. sudo jetson_clocks enabled

Output:

H2D: 0.404704 ms, FFT: 0.030976 ms, D2H: 0.096416 ms, TOT: 0.532096 ms
H2D: 0.416032 ms, FFT: 1.371104 ms, D2H: 0.090624 ms, TOT: 1.877760 ms
H2D: 0.102432 ms, FFT: 0.026784 ms, D2H: 0.097472 ms, TOT: 0.226688 ms
H2D: 0.102400 ms, FFT: 0.026688 ms, D2H: 0.096960 ms, TOT: 0.226048 ms
H2D: 0.100768 ms, FFT: 0.024896 ms, D2H: 0.097568 ms, TOT: 0.223232 ms
H2D: 0.100608 ms, FFT: 0.025248 ms, D2H: 0.094848 ms, TOT: 0.220704 ms
H2D: 0.100896 ms, FFT: 0.025184 ms, D2H: 0.099168 ms, TOT: 0.225248 ms
H2D: 0.100448 ms, FFT: 0.025824 ms, D2H: 0.088064 ms, TOT: 0.214336 ms
H2D: 0.100640 ms, FFT: 0.027296 ms, D2H: 0.096544 ms, TOT: 0.224480 ms
H2D: 1.025632 ms, FFT: 0.025280 ms, D2H: 0.097728 ms, TOT: 1.148640 ms
H2D: 0.410688 ms, FFT: 0.024192 ms, D2H: 0.087328 ms, TOT: 0.522208 ms
H2D: 0.412448 ms, FFT: 0.025088 ms, D2H: 0.096032 ms, TOT: 0.533568 ms
H2D: 0.405408 ms, FFT: 0.024928 ms, D2H: 0.096736 ms, TOT: 0.527072 ms
H2D: 0.406368 ms, FFT: 0.024384 ms, D2H: 0.096064 ms, TOT: 0.526816 ms
H2D: 0.410208 ms, FFT: 1.406528 ms, D2H: 0.354144 ms, TOT: 2.170880 ms
H2D: 0.406048 ms, FFT: 0.024576 ms, D2H: 0.087072 ms, TOT: 0.517696 ms
H2D: 0.409952 ms, FFT: 0.024320 ms, D2H: 0.095840 ms, TOT: 0.530112 ms
H2D: 0.403712 ms, FFT: 0.024416 ms, D2H: 0.086720 ms, TOT: 0.514848 ms
H2D: 0.406080 ms, FFT: 1.375776 ms, D2H: 0.089536 ms, TOT: 1.871392 ms
H2D: 0.101344 ms, FFT: 0.028064 ms, D2H: 0.097344 ms, TOT: 0.226752 ms
H2D: 0.100000 ms, FFT: 0.024768 ms, D2H: 0.088224 ms, TOT: 0.212992 ms
H2D: 0.099264 ms, FFT: 0.027168 ms, D2H: 0.097824 ms, TOT: 0.224256 ms
H2D: 0.099136 ms, FFT: 0.026720 ms, D2H: 0.097312 ms, TOT: 0.223168 ms
H2D: 0.101344 ms, FFT: 0.025696 ms, D2H: 0.096800 ms, TOT: 0.223840 ms
H2D: 0.100896 ms, FFT: 0.024960 ms, D2H: 0.089856 ms, TOT: 0.215712 ms
H2D: 0.101760 ms, FFT: 0.025280 ms, D2H: 0.096864 ms, TOT: 0.223904 ms
H2D: 0.100320 ms, FFT: 0.025120 ms, D2H: 0.096672 ms, TOT: 0.222112 ms
H2D: 0.099488 ms, FFT: 0.025344 ms, D2H: 0.096096 ms, TOT: 0.220928 ms
H2D: 0.100384 ms, FFT: 0.024544 ms, D2H: 0.097856 ms, TOT: 0.222784 ms
H2D: 0.098720 ms, FFT: 0.025184 ms, D2H: 0.096384 ms, TOT: 0.220288 ms
H2D: 0.103328 ms, FFT: 0.026592 ms, D2H: 0.096672 ms, TOT: 0.226592 ms
H2D: 0.894496 ms, FFT: 0.025920 ms, D2H: 0.096416 ms, TOT: 1.016832 ms
H2D: 0.409568 ms, FFT: 0.024320 ms, D2H: 0.088320 ms, TOT: 0.522208 ms
H2D: 0.409536 ms, FFT: 0.024448 ms, D2H: 0.087392 ms, TOT: 0.521376 ms
H2D: 0.411200 ms, FFT: 0.025568 ms, D2H: 0.096896 ms, TOT: 0.533664 ms
H2D: 0.403136 ms, FFT: 0.025440 ms, D2H: 0.095680 ms, TOT: 0.524256 ms
H2D: 0.405376 ms, FFT: 0.026144 ms, D2H: 0.089312 ms, TOT: 0.520832 ms

H2D over last 1000 samples:
min = 0.098336 ms
mean = 0.346365 ms
max = 1.552352 ms

FFT over last 1000 samples:
min = 0.023328 ms
mean = 0.056104 ms
max = 2.405088 ms

D2H over last 1000 samples:
min = 0.086208 ms
mean = 0.128372 ms
max = 2.453664 ms

TOT over last 1000 samples:
min = 0.211200 ms
mean = 0.530842 ms
max = 3.456896 ms

As you can see, on Jetson the timing is extremely unstable.
For example, the total time per iteration ranges from ~0.2 ms up to more than 3.4 ms, with a mean around 0.53 ms.

I know that neither Windows nor Jetson Linux are real-time operating systems, but while Windows timing jitter is acceptable for my use case, Jetson behavior is totally unacceptable.
My questions:

  1. Why is there such a large difference in timing stability between these two systems?

  2. What causes these large spikes on Jetson (especially on FFT and memcpy)?

  3. What can I do to make Jetson behave more like a soft real-time environment?

  4. Would installing the NVIDIA real-time kernel actually help in this case? (Installing real time kernel)

  5. What other steps can I take to reduce jitter on Jetson?

Final note: I also tried increasing the workload further (for example 1024 FFTs of 1024 samples, 1024 x 2048, 2048x2048, and other batch/size combinations), and the behavior is the same.

Any help or insight would be greatly appreciated.

Thank you.

Hi,

Pinned memory is a zero-copy memory, so you don’t need to copy the data manually.
The memory is visible for the GPU already.

Could you try our cuFFT sample below to see if the same behavior occurs?

More, could you use CUDA memory with memory copy for a test?
This can help us know whether the jitter comes from the FFT or the memory.

Thanks.

Hi,

I have found a solution to the issue, but I am not sure whether it is the “correct” approach. In any case, the solution raises some additional questions. Let me go step by step.

In my code I use:

cudaHostAlloc(&hBuf, bytes, cudaHostAllocDefault);

In this case, as far as I understand, the buffer is pinned but not mapped, so I still need to use cudaMemcpy.

I built and ran the provided cuFFT sample. I modified it slightly by adding CUDA events and increasing the FFT size. The behavior is the same: I still observe large timing jitter.

If I understood correctly, the idea was to isolate the FFT from the memory transfers. I removed the cufftExec calls entirely and tested a loop that only performs cudaMemcpy (H2D and D2H). I still observe the same timing instability.

Now about the solution I found: When running the Jetson in headless mode, the jitter disappears and the timing becomes very stable.

By headless mode I mean either:

  • Powering on the Jetson with a monitor connected but not logging in locally, and instead accessing it via SSH
    or

  • Logging in locally and then stopping the GUI with
    sudo systemctl stop gdm3, and continuing via SSH

In both cases, the timing becomes stable.

This suggests that the graphical environment is interfering with GPU execution. It appears that while my application is running, the GUI/compositor is also using GPU and memory resources, causing contention and timing spikes.

So my questions are:

  1. Is this expected behavior on Jetson platforms?

  2. Is headless operation the recommended approach for deterministic timing?

  3. For performance-sensitive applications, is running without an active graphical session considered best practice?

Hi,

Using Nsight, I have identified what is happening. As you can see in the screenshots below, there is sometimes a significant gap between the call to the CUDA API function vector_FFT and the actual execution of vector_FFT on the GPU.

Under normal conditions, the launch latency is around 8 µs, but occasionally it increases to about 1 ms.

What could be causing this behavior? What is the GPU doing during that time?

Thank you

Hi,

Could you share the sample that you used for Nsight profiling with us?
(the modified cuFFT sample?)

We need to reproduce this locally to check it further.

Thanks.

simpleCUFFT_Mod.txt (9.9 KB)

Attached is the source code. Please rename the file from .txt to .cu.

Run the code until you see something like:

min: 0.05796800, max: 1.10243800 (times are in milliseconds)

Usually, the issue appears immediately, or you have to run it 2 or 3 times.

Hi,

We test the kernel on AGX Orin with JetPack 6.2.2.
The app runs more than 20 times and min times are all around 0.028:

$ ./simpleCUFFT 
[simpleCUFFT_Mod] is starting...
GPU Device 0: "Ampere" with compute capability 8.7

new_size: 1024
Transforming signal cufftExecC2C
Warm-up...50 
numCycles = 100000 
min: 0.02812800, max: 0.10124800

Thanks.

Hi,

I’ve upgraded my Jetson AGX Orin to JetPack 6.2.2, so I assume we have the same hardware, software, and drivers.

The command cat /etc/nv_tegra_release returns the following:

R36 (release), REVISION: 5.0, GCID: 43688277, BOARD: generic, EABI: aarch64, DATE: Fri Jan 16 03:50:45 UTC 2026

KERNEL_VARIANT: oot

TARGET_USERSPACE_LIB_DIR=nvidia
TARGET_USERSPACE_LIB_DIR_PATH=usr/lib/aarch64-linux-gnu/nvidia

Before running the app, I run jetson_clocks and set nvpmodel 0. That said, this is the output when I run the app:

[simpleCUFFT_Mod] is starting... 
GPU Device 0: "Ampere" with compute capability 8.7 

new_size: 1024 
Transforming signal cufftExecC2C 
Warm-up...50 
numCycles = 100000 
min: 0.02681600, max: 3.45811200

As you can see, the min value is more or less the same, but my max value is much higher.

How is this possible? We seem to be running the same application, and everything else appears to match, but I am experiencing these high peaks while they do not appear on your device.

Are there any additional checks I should perform on my system (for example GPU clocks, power modes, background services, or CUDA configuration) to understand why these peaks appear on my device but not on yours?

Also, how are you compiling the application? Are you using any particular compilation flags? This is my compile command:

nvcc simpleCUFFT.cu -I"PATH/TO/cuda-samples/Common" -o simpleCUFFT -lcufft

Hi,

Do you have other tasks running at the same time?
More, is this issue reproduced with a rate or does it always happen when running the app?

After setting the device to power mode, the perf range is even closer.

[simpleCUFFT_Mod] is starting...
GPU Device 0: "Ampere" with compute capability 8.7

new_size: 1024
Transforming signal cufftExecC2C
Warm-up...50 
numCycles = 100000 
min: 0.02736000, max: 0.03977600

We check out the cuda-sample v12.5 branch:

Overwrite the Samples/4_CUDA_Libraries/simpleCUFFT with your source and compile with make directly

Thanks

Hi

I dont have other tasks running and I have the issue every time I run the app.

I rebuilt everything using the CUDA samples Makefile but I still observe high latency spikes.

More importantly, I verified that the issue is not related to cuFFT at all. I replaced the FFT call with an empty fake kernel, and the spikes are still present with the same behavior.

So it seems the problem is not in the FFT implementation but somewhere else in the system or in the CUDA launch/runtime.

Hi,

How do you set up the environment on your device?
Do you flash it with SDKmanager?

More, do you use AGX Orin 32GB or 64GB?
Our results are generated with a 32GB AGX Orin device.

Thanks.

Hi,

I’m using AGX Orin 64GB Developer Kit

The board originally came with Jetson Linux R35, and I was already observing the latency spikes there. Then I reflashed it using SDK Manager and upgraded to JetPack 6.2.2, and the spike issue is still present.

Hi,

Just want to double-confirm again.
Do you flash with r36.5 (JetPack 6.2.2) directly?
Or flash with some r36 BSP and upgrade into r36.5 with OTA (apt command).

Thanks.

Yes, I flashed directly!

Hi,

Thanks for your update.
We will verify this issue on a 64GB kit and update.

Thanks.

Hi,

We test this issue on a 64GB device with r36.5 but still not able to reproduce this issue.
Could you share the profiler data with us so we can check it further?

Thanks.

Hi

Attached is the report

report1.zip (6.4 MB)

Hi,

Thanks for your help.

We can reproduce this issue locally with a display connected.
Previously, we tested this in headless mode and the execution time is quite stable.
But once a display connected (either 32 or 64GB), the issue could be easily reproduced.

We are checking this issue with our internal team. Will get back to you later.
Thanks.

Hi,

just a small clarification. You can reproduce the issue even in headless mode, but it is less likely to occur. You need to increase the numCycles variable from 100,000 to a much larger value, for example 200,000,000, and then you should be able to observe the problem.