Hi everyone,
I am new to CUDA and Jetson. I have been doing some exercises to better understand how the whole thing behaves in practice.
I wrote a simple C/CUDA program that computes 256 FFTs of 1024 complex samples.
I use pinned host memory and repeat the operation **1000 times (**At each iteration I execute 256 batched FFTs of 1024 complex samples using cuFFT (batch = 256)) (discarding the first 100 iterations). The code structure looks like the following:
fill source buffer with some data
for i = 0 to 1000
{
copy data from Host to Device (H2D)
compute FFT using cufftExecC2C
copy data from Device to Host (D2H)
}
Using CUDA events, I measure:
-
H2D copy time
-
FFT execution time
-
D2H copy time
-
total time per iteration
I then ran the exact same code on my laptop and on a Jetson AGX Orin, and the results are very different. Test configurations:
System 1: Laptop + Windows 10 + NVIDIA RTX A2000
System 2: Jetson AGX Orin + Jetson Linux R35 + Ubuntu 20.04.6 LTS +CUDA 11.4
Results on Windows (1000 iterations)
This is a partial output from my program:
…
H2D: 0.198816 ms, FFT: 0.030560 ms, D2H: 0.184608 ms, TOT: 0.413984 ms
H2D: 0.305504 ms, FFT: 0.027072 ms, D2H: 0.204800 ms, TOT: 0.537376 ms
H2D: 0.217920 ms, FFT: 0.025760 ms, D2H: 0.186112 ms, TOT: 0.429792 ms
H2D: 0.230336 ms, FFT: 0.026848 ms, D2H: 0.184800 ms, TOT: 0.441984 ms
H2D: 0.214368 ms, FFT: 0.027136 ms, D2H: 0.185472 ms, TOT: 0.426976 ms
H2D: 0.211008 ms, FFT: 0.026400 ms, D2H: 0.184896 ms, TOT: 0.422304 ms
H2D: 0.206048 ms, FFT: 0.025696 ms, D2H: 0.186016 ms, TOT: 0.417760 ms
H2D: 0.205152 ms, FFT: 0.026176 ms, D2H: 0.200416 ms, TOT: 0.431744 ms
H2D: 0.205664 ms, FFT: 0.026848 ms, D2H: 0.200960 ms, TOT: 0.433472 ms
H2D: 0.249184 ms, FFT: 0.036096 ms, D2H: 0.185632 ms, TOT: 0.470912 ms
H2D: 0.306688 ms, FFT: 0.025920 ms, D2H: 0.200992 ms, TOT: 0.533600 ms
H2D: 0.216608 ms, FFT: 0.027136 ms, D2H: 0.185056 ms, TOT: 0.428800 ms
…
H2D over last 1000 samples:
min = 0.198816 ms
mean = 0.220522 ms
max = 0.373600 ms
FFT over last 1000 samples:
min = 0.018720 ms
mean = 0.023816 ms
max = 0.061504 ms
D2H over last 1000 samples:
min = 0.179680 ms
mean = 0.187218 ms
max = 0.359936 ms
TOT over last 1000 samples:
min = 0.406240 ms
mean = 0.431556 ms
max = 0.723488 ms
Timing is quite stable and predictable.
Results on Jetson AGX Orin (1000 iterations)
Before running the test:
-
Power mode set to MAXN
-
sudo jetson_clocks enabled
Output:
H2D: 0.404704 ms, FFT: 0.030976 ms, D2H: 0.096416 ms, TOT: 0.532096 ms
H2D: 0.416032 ms, FFT: 1.371104 ms, D2H: 0.090624 ms, TOT: 1.877760 ms
H2D: 0.102432 ms, FFT: 0.026784 ms, D2H: 0.097472 ms, TOT: 0.226688 ms
H2D: 0.102400 ms, FFT: 0.026688 ms, D2H: 0.096960 ms, TOT: 0.226048 ms
H2D: 0.100768 ms, FFT: 0.024896 ms, D2H: 0.097568 ms, TOT: 0.223232 ms
H2D: 0.100608 ms, FFT: 0.025248 ms, D2H: 0.094848 ms, TOT: 0.220704 ms
H2D: 0.100896 ms, FFT: 0.025184 ms, D2H: 0.099168 ms, TOT: 0.225248 ms
H2D: 0.100448 ms, FFT: 0.025824 ms, D2H: 0.088064 ms, TOT: 0.214336 ms
H2D: 0.100640 ms, FFT: 0.027296 ms, D2H: 0.096544 ms, TOT: 0.224480 ms
H2D: 1.025632 ms, FFT: 0.025280 ms, D2H: 0.097728 ms, TOT: 1.148640 ms
H2D: 0.410688 ms, FFT: 0.024192 ms, D2H: 0.087328 ms, TOT: 0.522208 ms
H2D: 0.412448 ms, FFT: 0.025088 ms, D2H: 0.096032 ms, TOT: 0.533568 ms
H2D: 0.405408 ms, FFT: 0.024928 ms, D2H: 0.096736 ms, TOT: 0.527072 ms
H2D: 0.406368 ms, FFT: 0.024384 ms, D2H: 0.096064 ms, TOT: 0.526816 ms
H2D: 0.410208 ms, FFT: 1.406528 ms, D2H: 0.354144 ms, TOT: 2.170880 ms
H2D: 0.406048 ms, FFT: 0.024576 ms, D2H: 0.087072 ms, TOT: 0.517696 ms
H2D: 0.409952 ms, FFT: 0.024320 ms, D2H: 0.095840 ms, TOT: 0.530112 ms
H2D: 0.403712 ms, FFT: 0.024416 ms, D2H: 0.086720 ms, TOT: 0.514848 ms
H2D: 0.406080 ms, FFT: 1.375776 ms, D2H: 0.089536 ms, TOT: 1.871392 ms
H2D: 0.101344 ms, FFT: 0.028064 ms, D2H: 0.097344 ms, TOT: 0.226752 ms
H2D: 0.100000 ms, FFT: 0.024768 ms, D2H: 0.088224 ms, TOT: 0.212992 ms
H2D: 0.099264 ms, FFT: 0.027168 ms, D2H: 0.097824 ms, TOT: 0.224256 ms
H2D: 0.099136 ms, FFT: 0.026720 ms, D2H: 0.097312 ms, TOT: 0.223168 ms
H2D: 0.101344 ms, FFT: 0.025696 ms, D2H: 0.096800 ms, TOT: 0.223840 ms
H2D: 0.100896 ms, FFT: 0.024960 ms, D2H: 0.089856 ms, TOT: 0.215712 ms
H2D: 0.101760 ms, FFT: 0.025280 ms, D2H: 0.096864 ms, TOT: 0.223904 ms
H2D: 0.100320 ms, FFT: 0.025120 ms, D2H: 0.096672 ms, TOT: 0.222112 ms
H2D: 0.099488 ms, FFT: 0.025344 ms, D2H: 0.096096 ms, TOT: 0.220928 ms
H2D: 0.100384 ms, FFT: 0.024544 ms, D2H: 0.097856 ms, TOT: 0.222784 ms
H2D: 0.098720 ms, FFT: 0.025184 ms, D2H: 0.096384 ms, TOT: 0.220288 ms
H2D: 0.103328 ms, FFT: 0.026592 ms, D2H: 0.096672 ms, TOT: 0.226592 ms
H2D: 0.894496 ms, FFT: 0.025920 ms, D2H: 0.096416 ms, TOT: 1.016832 ms
H2D: 0.409568 ms, FFT: 0.024320 ms, D2H: 0.088320 ms, TOT: 0.522208 ms
H2D: 0.409536 ms, FFT: 0.024448 ms, D2H: 0.087392 ms, TOT: 0.521376 ms
H2D: 0.411200 ms, FFT: 0.025568 ms, D2H: 0.096896 ms, TOT: 0.533664 ms
H2D: 0.403136 ms, FFT: 0.025440 ms, D2H: 0.095680 ms, TOT: 0.524256 ms
H2D: 0.405376 ms, FFT: 0.026144 ms, D2H: 0.089312 ms, TOT: 0.520832 ms
H2D over last 1000 samples:
min = 0.098336 ms
mean = 0.346365 ms
max = 1.552352 ms
FFT over last 1000 samples:
min = 0.023328 ms
mean = 0.056104 ms
max = 2.405088 ms
D2H over last 1000 samples:
min = 0.086208 ms
mean = 0.128372 ms
max = 2.453664 ms
TOT over last 1000 samples:
min = 0.211200 ms
mean = 0.530842 ms
max = 3.456896 ms
As you can see, on Jetson the timing is extremely unstable.
For example, the total time per iteration ranges from ~0.2 ms up to more than 3.4 ms, with a mean around 0.53 ms.
I know that neither Windows nor Jetson Linux are real-time operating systems, but while Windows timing jitter is acceptable for my use case, Jetson behavior is totally unacceptable.
My questions:
-
Why is there such a large difference in timing stability between these two systems?
-
What causes these large spikes on Jetson (especially on FFT and memcpy)?
-
What can I do to make Jetson behave more like a soft real-time environment?
-
Would installing the NVIDIA real-time kernel actually help in this case? (Installing real time kernel)
-
What other steps can I take to reduce jitter on Jetson?
Final note: I also tried increasing the workload further (for example 1024 FFTs of 1024 samples, 1024 x 2048, 2048x2048, and other batch/size combinations), and the behavior is the same.
Any help or insight would be greatly appreciated.
Thank you.


