CUDA ORB performance on Xavier compared to QuadroM2000M

I’m running a very simple ORB feature detection (detect & compare) of 2 images, followed by matching.

My code is very similar to what’s described in:
https://devtalk.nvidia.com/default/topic/1035448/cuda-programming-and-performance/surf-with-cuda-is-not-faster-by-a-noticeable-amount/post/5260064/#5260064
but using ORB instead of SIFT.

My system has OpenCV 4.1.2 compiled for CUDA with ARCH_BIN 7.2 (Release).
I’m using 4K images for my measurements.
My Jetson Xavier runs @15W setting (4 cores @1190Mhz, GPU @ 318-675Mhz). Detecting the features on a single image ~310ms.
When I run the test on PC with Nvidia QuadroM2000M on the same images, it takes 70ms on a single image.

QuadroM2000M has 768 CUDA cores while Xavier has only 512 CUDA cores. Also, QuadroM2000M is clocked at 1100Mhz while Xavier GPU is clocked at 675Mhz max. I’m not sure these differences can account for the entire performance gap between QuadroM2000M and Xavier.

Is it possible to get better performance for ORB on Xavier?

Have you profiled the application? On the GPU side, is it compute bound, memory bound, or a mix thereof? Assuming it is a mix, you would also need to look at memory throughput of the two GPUs. I assume that like in other processors in the Jetson family, memory is shared between CPU and GPU, while they are separate resources in your PC. If the application has a non-negligible CPU component, you would also need to consider CPU performance.

As for the general question of performance tuning for Xavier, that is a question for which you will likely receive better / faster answers in the sub-forum dedicated to Xavier: https://devtalk.nvidia.com/default/board/326/jetson-agx-xavier/

Thanks for your quick reply.
I did some profiling using nvprof. The most significant impact on my applications performance is the 1st time call to ‘cudaLaunchKernel()’. The max time is much greater than the average.

API calls: 62.69% 1.16303s 407 2.8576ms 54.048us 834.64ms cudaLaunchKernel

I found that taking a ‘dummy first round’ of my application flow (load matricces to GPU, cuda::ORB detectAndCompute()) API greatly reduces the performance penalty of the consequent detectAndCompute() runs.
695980[us] <- 1st iteration takes the longest
188094[us]
162093[us]
154616[us]
138943[us]
140696[us]
137826[us]
134849[us]
135664[us]
135217[us]

Its also visible that next iterations runtime continues to imrpove, going down from 188ms to 135 ms.
Nvidia forums show this ‘1st time’ penalty is well known (https://devtalk.nvidia.com/default/topic/517828/speed-up-initialization-of-cuda-about-how-to-set-the-device-code-translation-cache/), but solutions are not adequate.

Is there any elegant CUDA API I can call using OpenCV that can reduce the 1st iteration penalty?

Initialization overhead in the CUDA runtime is not there merely to annoy programmers, it is there because necessary context setup work has to be done. Programmers can trigger the initialization outside the timed portion of their code. One canonical way is to call cudaFree(0) to trigger CUDA context initialization.

Note 1: Various libraries and middleware upstream of the CUDA runtime may have also have their own context initialization overhead.

Note 2: Make sure all device code is compiled for the correct target architecture, otherwise the application will incur JIT-compilation overhead. Use cuobjdump --dump-sass to double check that machine code for the correct architecture (sm_72 for Xavier, I think) is present in the executable.

The mapping of all memory (system and GPUs) into a unified virtual memory map can represent a significant part of CUDA context creating overhead. Not sure how that plays out on a Xavier platform though. In systems with multiple discrete GPUs one can shorten the time needed for this mapping by inactivating GPUs that are not needed with CUDA_VISIBLE_DEVICES. That is unlikely to apply here.

The further speed-ups you are observing after the first iteration may be due to standard warm-up effects, mostly in the memory subsystem, and/or dynamic clock adjustments from power-management mechanism (“boost clocks”).