Cannot reach advertised FLOPs on Jetson Nano

I’ve spent a few months adapting the Jetson Nano code base to buildroot and designing hardware with the available documentation in order to be able to launch a product (1k+ units) as soon as the SoM became available.

Unfortunately I’ve been unable to reach the promised 472 GFLOPs or either the theoretical (GPU) 246 GFLOPs (2xCORESxFREQ) on most of my test scenarios.
Since I’m working on time-critical applications, I usually repeat the upload->process->download sequence every time for every single frame (no batch processing allowed).
Still, I am allowed to run many kernels for each frame, so the upload / download costs should be somewhat amortized (eventually).

As an example, computing the sum of absolute differences with a 5x5 kernel size on a 640x533 image for 16 disparities should take about 554us (considering just FLOPs).
Running a program using a naive approach takes about 80ms for 30 disparities, while using the sample provided by Nvidia takes about 20ms for 16 disparities.
Even if memory bandwidth was to blame for some of this reduced performance (1GB/s for pageable memory, 2 GB/s for pinned memory), image transfer should take at most 3ms.
Profiling the sample program confirms that transfer times are on the order of 3ms while kernel times seem to be 100ms for the warmup run and 20ms for the other run.
In summary, the provided sample kernel runs 40x slower than the expected value, while total execution time would be 5x slower at best.

The performance gap seems to get worse for LBPH, HOG and especially SVM.

Apart from writing my own CUDA kernels, I’ve also run tests with OpenCV, Dlib and NPP, hitting similar performance limitations.

Is there any way to get closer to the advertised FLOPs or should I consistently expect the same performance penalty for most kernels?

Best Regards,

The upload/download time will not “amortize” unless you pipeline your implementation.
The synchronous blocking you say you do is what’s taking the time. The floating point units are likely spending most of their time idle.

Hi Snarky,
Thanks for your response.

By amortize I didn’t mean to say the synchronous time would actually become smaller (it will always take the same time to do those transfers), but instead tried to imply that executing multiple kernels that use device memory would make the impact of host transfers smaller (relative to the total execution time). Even in that case, there will be some amount of time associated with data movement of device memory.

Pipelining requires asynchronous transfers and pinned memory. While asynchronous transfers should be faster, they ran slower on some of my tests.
Splitting data in streams should (in theory) lower transfer times due to concurrent execution of data transfer and kernel operations.
Unfortunately the Nano has only one copy engine, so I would still have to wait on some of the memory transfers.
I might reconsider this idea depending on what I see when using nvprof in the near future.

Moving on to the kernel execution times, even if I were able to reduce transfer times to zero, there would still be a performance gap between my calculations and what actually happens.
In some cases this could be accounted for by other types of data movements and specific functions (sqrt, sin, cos, etc), yet it doesn’t seem to be the case for SAD and LBP.

One thing that I also find interesting is the fact that memory bandwidth seems to change depending on the size of the transfer (bigger transfers get more bandwidth) and that the matrix multiply example follows a similar logic (more FLOPs for bigger matrix) but gets stuck around 32 GFLOPs.

So, I still wonder if someone has been able to get results in line with the peak GFLOPs.

Best Regards,

Just to clarify a little bit further on my previous comments, I’ve been using the program on “/usr/local/cuda/samples/0_Simple/matrixMul” for measuring FLOPs and the one on “/usr/local/cuda/samples/1_Utilities/bandwidthTest/” for measuring bandwidth (on Jetson Nano, with Jetpack 4.2 SD Card Image).

For matrixMul a 32 GFLOPs limit seems to be reached with the following (or similar) parameters “matrixMul --wA=4096 -hA=2048 -wB=2048 -hB=4096”.

For bandwidthTest I check the maximum bandwidth for different transfer sizes with “bandwidthTest --memory=pageable --mode=shmoo” for pageable memory and ““bandwidthTest --memory=pinned --mode=shmoo”” for pinned memory. For a full hd (1920x1080) image this value is around 1 GB/s for pageable memory and 2 GB/s for pinned memory (on host to device transfers).

Best Regards,

Latency and bandwidth are two of the three costs – the third being “synchronization.”
Unfortunately, that’s also the hardest to quantify without testing various possible implementations, because it’s an implementation detail of the runtime and may sometimes pop up in quite surprising places.

I’ll be going back to my original question “is there any way to get closer to the advertised FLOPs or should I consistently expect the same performance penalty for most kernels?”
I expected a yes (try this code that runs at theoretical FLOPs) or no (it is not possible to achieve maximum FLOPs) answer.
A series of steps to consistently optimize GPU code would have also been greatly appreciated.

As far as I’ve seen, carefully selecting block / thread sizes, taking advantage of shared memory, coalescing memory accesses and thinking about warps should help (yet doesn’t always do so).
Also, the algorithms I’ve mentioned so far are a variation of a stencil and / or parallel reduction problem (convolution, lbp, hog, svm, stereo bm) so they should benefit the most from a GPU.

I find the answers so far not satisfying since they don’t answer:

  • Whether, under specific conditions, it is possible to achive theoretical FLOPs (with code demonstrating so).
  • If it is impossible to ever achive theoretical FLOPs (given some clearly explained practical limitation).
  • What series of steps can be followed to methodically improve under-performing code.

I think the above questions should have definite answers as is the case for other platforms (cpu, dsp and fpga).