I’ve spent a few months adapting the Jetson Nano code base to buildroot and designing hardware with the available documentation in order to be able to launch a product (1k+ units) as soon as the SoM became available.
Unfortunately I’ve been unable to reach the promised 472 GFLOPs or either the theoretical (GPU) 246 GFLOPs (2xCORESxFREQ) on most of my test scenarios.
Since I’m working on time-critical applications, I usually repeat the upload->process->download sequence every time for every single frame (no batch processing allowed).
Still, I am allowed to run many kernels for each frame, so the upload / download costs should be somewhat amortized (eventually).
As an example, computing the sum of absolute differences with a 5x5 kernel size on a 640x533 image for 16 disparities should take about 554us (considering just FLOPs).
Running a program using a naive approach takes about 80ms for 30 disparities, while using the sample provided by Nvidia takes about 20ms for 16 disparities.
Even if memory bandwidth was to blame for some of this reduced performance (1GB/s for pageable memory, 2 GB/s for pinned memory), image transfer should take at most 3ms.
Profiling the sample program confirms that transfer times are on the order of 3ms while kernel times seem to be 100ms for the warmup run and 20ms for the other run.
In summary, the provided sample kernel runs 40x slower than the expected value, while total execution time would be 5x slower at best.
The performance gap seems to get worse for LBPH, HOG and especially SVM.
Apart from writing my own CUDA kernels, I’ve also run tests with OpenCV, Dlib and NPP, hitting similar performance limitations.
Is there any way to get closer to the advertised FLOPs or should I consistently expect the same performance penalty for most kernels?