Unstable CUDA timing on Jetson AGX Orin compared to Windows GPU

Hi everyone,
I am new to CUDA and Jetson. I have been doing some exercises to better understand how the whole thing behaves in practice.

I wrote a simple C/CUDA program that computes 256 FFTs of 1024 complex samples.
I use pinned host memory and repeat the operation 1000 times (discarding the first 100 iterations). The code structure looks like the following:

fill source buffer with some data

for i = 0 to 1000
{
copy data from Host to Device (H2D)
compute FFT using cufftExecC2C
copy data from Device to Host (D2H)
}

Using CUDA events, I measure:

  • H2D copy time
  • FFT execution time
  • D2H copy time
  • total time per iteration

I then ran the exact same code on my laptop and on a Jetson AGX Orin, and the results are very different. Test configurations:
System 1: Laptop + Windows 10 + NVIDIA RTX A2000
System 2: Jetson AGX Orin + Jetson Linux R35 + Ubuntu 20.04.6 LTS +CUDA 11.4

Results on Windows (1000 iterations)

This is a partial output from my program:

…
H2D: 0.198816 ms, FFT: 0.030560 ms, D2H: 0.184608 ms, TOT: 0.413984 ms
H2D: 0.305504 ms, FFT: 0.027072 ms, D2H: 0.204800 ms, TOT: 0.537376 ms
H2D: 0.217920 ms, FFT: 0.025760 ms, D2H: 0.186112 ms, TOT: 0.429792 ms
H2D: 0.230336 ms, FFT: 0.026848 ms, D2H: 0.184800 ms, TOT: 0.441984 ms
H2D: 0.214368 ms, FFT: 0.027136 ms, D2H: 0.185472 ms, TOT: 0.426976 ms
H2D: 0.211008 ms, FFT: 0.026400 ms, D2H: 0.184896 ms, TOT: 0.422304 ms
H2D: 0.206048 ms, FFT: 0.025696 ms, D2H: 0.186016 ms, TOT: 0.417760 ms
H2D: 0.205152 ms, FFT: 0.026176 ms, D2H: 0.200416 ms, TOT: 0.431744 ms
H2D: 0.205664 ms, FFT: 0.026848 ms, D2H: 0.200960 ms, TOT: 0.433472 ms
H2D: 0.249184 ms, FFT: 0.036096 ms, D2H: 0.185632 ms, TOT: 0.470912 ms
H2D: 0.306688 ms, FFT: 0.025920 ms, D2H: 0.200992 ms, TOT: 0.533600 ms
H2D: 0.216608 ms, FFT: 0.027136 ms, D2H: 0.185056 ms, TOT: 0.428800 ms
…

H2D over last 1000 samples:
min = 0.198816 ms
mean = 0.220522 ms
max = 0.373600 ms

FFT over last 1000 samples:
min = 0.018720 ms
mean = 0.023816 ms
max = 0.061504 ms

D2H over last 1000 samples:
min = 0.179680 ms
mean = 0.187218 ms
max = 0.359936 ms

TOT over last 1000 samples:
min = 0.406240 ms
mean = 0.431556 ms
max = 0.723488 ms

Timing is quite stable and predictable.

Results on Jetson AGX Orin (1000 iterations)

Before running the test:

  1. Power mode set to MAXN
  2. sudo jetson_clocks enabled

Output:

H2D: 0.404704 ms, FFT: 0.030976 ms, D2H: 0.096416 ms, TOT: 0.532096 ms
H2D: 0.416032 ms, FFT: 1.371104 ms, D2H: 0.090624 ms, TOT: 1.877760 ms
H2D: 0.102432 ms, FFT: 0.026784 ms, D2H: 0.097472 ms, TOT: 0.226688 ms
H2D: 0.102400 ms, FFT: 0.026688 ms, D2H: 0.096960 ms, TOT: 0.226048 ms
H2D: 0.100768 ms, FFT: 0.024896 ms, D2H: 0.097568 ms, TOT: 0.223232 ms
H2D: 0.100608 ms, FFT: 0.025248 ms, D2H: 0.094848 ms, TOT: 0.220704 ms
H2D: 0.100896 ms, FFT: 0.025184 ms, D2H: 0.099168 ms, TOT: 0.225248 ms
H2D: 0.100448 ms, FFT: 0.025824 ms, D2H: 0.088064 ms, TOT: 0.214336 ms
H2D: 0.100640 ms, FFT: 0.027296 ms, D2H: 0.096544 ms, TOT: 0.224480 ms
H2D: 1.025632 ms, FFT: 0.025280 ms, D2H: 0.097728 ms, TOT: 1.148640 ms
H2D: 0.410688 ms, FFT: 0.024192 ms, D2H: 0.087328 ms, TOT: 0.522208 ms
H2D: 0.412448 ms, FFT: 0.025088 ms, D2H: 0.096032 ms, TOT: 0.533568 ms
H2D: 0.405408 ms, FFT: 0.024928 ms, D2H: 0.096736 ms, TOT: 0.527072 ms
H2D: 0.406368 ms, FFT: 0.024384 ms, D2H: 0.096064 ms, TOT: 0.526816 ms
H2D: 0.410208 ms, FFT: 1.406528 ms, D2H: 0.354144 ms, TOT: 2.170880 ms
H2D: 0.406048 ms, FFT: 0.024576 ms, D2H: 0.087072 ms, TOT: 0.517696 ms
H2D: 0.409952 ms, FFT: 0.024320 ms, D2H: 0.095840 ms, TOT: 0.530112 ms
H2D: 0.403712 ms, FFT: 0.024416 ms, D2H: 0.086720 ms, TOT: 0.514848 ms
H2D: 0.406080 ms, FFT: 1.375776 ms, D2H: 0.089536 ms, TOT: 1.871392 ms
H2D: 0.101344 ms, FFT: 0.028064 ms, D2H: 0.097344 ms, TOT: 0.226752 ms
H2D: 0.100000 ms, FFT: 0.024768 ms, D2H: 0.088224 ms, TOT: 0.212992 ms
H2D: 0.099264 ms, FFT: 0.027168 ms, D2H: 0.097824 ms, TOT: 0.224256 ms
H2D: 0.099136 ms, FFT: 0.026720 ms, D2H: 0.097312 ms, TOT: 0.223168 ms
H2D: 0.101344 ms, FFT: 0.025696 ms, D2H: 0.096800 ms, TOT: 0.223840 ms
H2D: 0.100896 ms, FFT: 0.024960 ms, D2H: 0.089856 ms, TOT: 0.215712 ms
H2D: 0.101760 ms, FFT: 0.025280 ms, D2H: 0.096864 ms, TOT: 0.223904 ms
H2D: 0.100320 ms, FFT: 0.025120 ms, D2H: 0.096672 ms, TOT: 0.222112 ms
H2D: 0.099488 ms, FFT: 0.025344 ms, D2H: 0.096096 ms, TOT: 0.220928 ms
H2D: 0.100384 ms, FFT: 0.024544 ms, D2H: 0.097856 ms, TOT: 0.222784 ms
H2D: 0.098720 ms, FFT: 0.025184 ms, D2H: 0.096384 ms, TOT: 0.220288 ms
H2D: 0.103328 ms, FFT: 0.026592 ms, D2H: 0.096672 ms, TOT: 0.226592 ms
H2D: 0.894496 ms, FFT: 0.025920 ms, D2H: 0.096416 ms, TOT: 1.016832 ms
H2D: 0.409568 ms, FFT: 0.024320 ms, D2H: 0.088320 ms, TOT: 0.522208 ms
H2D: 0.409536 ms, FFT: 0.024448 ms, D2H: 0.087392 ms, TOT: 0.521376 ms
H2D: 0.411200 ms, FFT: 0.025568 ms, D2H: 0.096896 ms, TOT: 0.533664 ms
H2D: 0.403136 ms, FFT: 0.025440 ms, D2H: 0.095680 ms, TOT: 0.524256 ms
H2D: 0.405376 ms, FFT: 0.026144 ms, D2H: 0.089312 ms, TOT: 0.520832 ms

H2D over last 1000 samples:
min = 0.098336 ms
mean = 0.346365 ms
max = 1.552352 ms

FFT over last 1000 samples:
min = 0.023328 ms
mean = 0.056104 ms
max = 2.405088 ms

D2H over last 1000 samples:
min = 0.086208 ms
mean = 0.128372 ms
max = 2.453664 ms

TOT over last 1000 samples:
min = 0.211200 ms
mean = 0.530842 ms
max = 3.456896 ms

As you can see, on Jetson the timing is extremely unstable.
For example, the total time per iteration ranges from ~0.2 ms up to more than 3.4 ms, with a mean around 0.53 ms.

I know that neither Windows nor Jetson Linux are real-time operating systems, but while Windows timing jitter is acceptable for my use case, Jetson behavior is totally unacceptable.
My questions:

  1. Why is there such a large difference in timing stability between these two systems?
  2. What causes these large spikes on Jetson (especially on FFT and memcpy)?
  3. What can I do to make Jetson behave more like a soft real-time environment?
  4. Would installing the NVIDIA real-time kernel actually help in this case? (Installing real time kernel)
  5. What other steps can I take to reduce jitter on Jetson?

Any help or insight would be greatly appreciated.

Thank you.

Just a quick first answer: Computing a single 1D FFT of 1024 complex numbers per loop iteration underutilizes a GPU massively. I would switch to computing at least 1024 (or even more) of those in batch mode per loop iteration. Otherwise you get all kind of measurement noise for memory transfer and kernel invocation.

Thanks for the comment. I think I didn’t explain that part clearly.

In my test I am not running a single 1024-point FFT.
At each iteration I execute 256 batched FFTs of 1024 complex samples using cuFFT (batch = 256).

I also tried increasing the workload further (for example 1024 FFTs of 1024 samples, 1024 x 2048, 2048x2048, and other batch/size combinations), and the behavior is the same.

I have no experience with NVIDIA’s embedded systems, but generally speaking, memory-intensive processes tend to exhibit a large amount of noise / jitter when benchmarking. This is due to the complex interaction between many different mechanism inside the DRAM itself, the DRAM controllers, and the processor cores, coupled with the fact that it is impossible to replicate identical starting states across all these mechanisms. These effects tend to be more pronounced when the bandwidth needs of the process under test approach the maximum bandwidth of the DRAM subsystem. To visualize this, consider the stop-and-go waves propagating through traffic that approaches the maximum vehicle throughput of a highway.

A typical approach for creating reasonably stable measurements of the performance of memory copies (the simplest kind of memory-intensive processing possible) involves running ten times and recording the lowest time measured, and the same approach should be useful for assessing FFT performance. This is useful to track overall performance as modifications to the code are made. Obviously, it does nothing to address the jitter problem itself. Frankly, the only way I know of addressing jitter issues is to massively simplify the hardware (and software, where a contributing factor). This usually means using lower-end CPUs (*) with simple scalar pipelines and minimal or no caching, coupled with simple memory subsystems (maybe using SRAM instead of DRAM), and using a simple RTOS. Alternatively, one might consider the use of FPGAs.

(*) I am aware of a few exceptions, e.g. certain “deterministic” DSPs from Analog Devices operating at frequencies up to 1 GHz.


Major technical expertise for the Jetson systems is found in the sub-forums dedicated to them, so it is best to ask questions about these systems there: