Hi,
I just transfered the code I have been working on from my laptop to the Jetson Xavier NX dev kit (in 20W 6 cores mode).
I was expecting some slow-down, both on CPU and GPU code, but no more than a factor 2-3 at worst.
On the sections of the code making use of GPU (CUDA-OpenCV functions), I noticed a far bigger slow-down than expected (factors 5.5 to 14), which I can’t explain.
Do you have any idea why GPU code is so slow on the Jetson?
The timings :
function avg time on laptop avg time on Jetson ratio
cv::cuda::remap
(undistortion and 1.14ms 6.39ms 5.6
setero recification)
optical flow 0.66ms 6.72ms 10.2
(no up/downloads)
some uploads and 0.03ms 0.42ms 14.0
downloads
cv::cuda::goodFeaturesToMatch 2.23ms 11.03ms 4.9
The GPU configuations :
GPU comparison PC Jetson
CUDA Driver Version / Runtime Version 11.7 / 11.7 11.4 / 11.4
CUDA Capability Major/Minor version number 7.5 7.2
Total amount of global memory 3912 MB 14908 MB
Number of CUDA Cores : 1024 384
GPU Max Clock rate 1.25 GHz 1.1 GHz
Memory Clock rate 3501 Mhz 1109 MHz
Memory Bus Width 128-bit 256-bit
L2 Cache Size 1048576 Bytes 524288 Bytes
Maximum Texture Dimension Size (x,y,z) identical
Maximum Layered 1D Texture Size, (num) layers identical
Maximum Layered 2D Texture Size, (num) layers identical
Total amount of constant memory identical
Total amount of shared memory per block identical
Total shared memory per multiprocessor 65536 bytes 98304 bytes
Total number of registers available per block identical
Warp size identical
Maximum number of threads per multiprocessor 1024 2048
Maximum number of threads per block identical
Max dimension size of a thread block (x,y,z) identical
Max dimension size of a grid size (x,y,z) identical
Maximum memory pitch identical
Texture alignment identical
Concurrent copy and kernel execution Yes, 3 copy engine(s) Yes, 1 copy engine(s)
So if the speed is compute limited, I would expect a slowdown of a factor at most around 3 (3 time less cores, 10% slower GPU clock).
If the speed is limited by memory access time, then I would expect a slowdown between 1.5 an 3 (maybe 6) :
- memory clock is 3 times slower
- but bus width is 2 times wider (so up to 2x faster) → I suppose probably 1.5 times slower when data is continuous, 3 time slower for random access
- cache is 2 time smaller, so maybe another factor 2
So I would suppose at most a factor 6 slowdown for memory access (probably rather around 3 times slower)
And I suppose that computation slowdown and memory slowdown are not cumulative, so a maximal global slowdown of a factor 3 (worst case 6).
So how is it possible that I get a factor 10.2 slowdown for the optical flow function?
Thanks a lot in advance