I just transfered the code I have been working on from my laptop to the Jetson Xavier NX dev kit (in 20W 6 cores mode).
I was expecting some slow-down, both on CPU and GPU code, but no more than a factor 2-3 at worst.
On the sections of the code making use of GPU (CUDA-OpenCV functions), I noticed a far bigger slow-down than expected (factors 5.5 to 14), which I can’t explain.
Do you have any idea why GPU code is so slow on the Jetson?
The timings :
function avg time on laptop avg time on Jetson ratio cv::cuda::remap (undistortion and 1.14ms 6.39ms 5.6 setero recification) optical flow 0.66ms 6.72ms 10.2 (no up/downloads) some uploads and 0.03ms 0.42ms 14.0 downloads cv::cuda::goodFeaturesToMatch 2.23ms 11.03ms 4.9
The GPU configuations :
GPU comparison PC Jetson CUDA Driver Version / Runtime Version 11.7 / 11.7 11.4 / 11.4 CUDA Capability Major/Minor version number 7.5 7.2 Total amount of global memory 3912 MB 14908 MB Number of CUDA Cores : 1024 384 GPU Max Clock rate 1.25 GHz 1.1 GHz Memory Clock rate 3501 Mhz 1109 MHz Memory Bus Width 128-bit 256-bit L2 Cache Size 1048576 Bytes 524288 Bytes Maximum Texture Dimension Size (x,y,z) identical Maximum Layered 1D Texture Size, (num) layers identical Maximum Layered 2D Texture Size, (num) layers identical Total amount of constant memory identical Total amount of shared memory per block identical Total shared memory per multiprocessor 65536 bytes 98304 bytes Total number of registers available per block identical Warp size identical Maximum number of threads per multiprocessor 1024 2048 Maximum number of threads per block identical Max dimension size of a thread block (x,y,z) identical Max dimension size of a grid size (x,y,z) identical Maximum memory pitch identical Texture alignment identical Concurrent copy and kernel execution Yes, 3 copy engine(s) Yes, 1 copy engine(s)
So if the speed is compute limited, I would expect a slowdown of a factor at most around 3 (3 time less cores, 10% slower GPU clock).
If the speed is limited by memory access time, then I would expect a slowdown between 1.5 an 3 (maybe 6) :
- memory clock is 3 times slower
- but bus width is 2 times wider (so up to 2x faster) → I suppose probably 1.5 times slower when data is continuous, 3 time slower for random access
- cache is 2 time smaller, so maybe another factor 2
So I would suppose at most a factor 6 slowdown for memory access (probably rather around 3 times slower)
And I suppose that computation slowdown and memory slowdown are not cumulative, so a maximal global slowdown of a factor 3 (worst case 6).
So how is it possible that I get a factor 10.2 slowdown for the optical flow function?
Thanks a lot in advance