Jetson Nano convolution operation as fast as possible

Hello. I have Jetson Nano development board (JetPack 4.3) with which I’m developing a simple image processing application, in which I have to apply as fast as possible convolutions with a 7x7 kernel matrix into a 640x512 image. Now I have openCV4 library in QT Application with CUDA support and I’m using cv::cuda::convolution class’s convolve method. The problem is that during execution of a cycle of convolutions I see GPU running not at 100% (50/60%) and this is not OK for performances. I’ve also tried to use (in QT application) VisionWorks classes and functions for convolution but the result is that comuptation is executed by CPU and not by GPU. What do you think is the best way to realize a QT application to perform convolutions (kernel 7x7, image 640x512) using GPU to obtain best performances? Is it better to use openCV, or VisionWorks? Please explain the solutions in details because I’m quite new at Jetson Nano development. Thank you all very much indeed! LuigiDo

I’m not sure exactly how the gpu reports usage, but you’ve got quite a memory bottleneck there. jetson nano can do 472 gflops which is (472 * 4 bytes =) 1,888 giga byte ops per second. This is in contrast to the 25.6 Gigabtyes/second in memory. This is a ratio of 1888/25.6 = 73.75. Meaning you’d need to do 73 operations on each byte of memory read to not be memory bottlenecked. You can try feeding in a lot of images and timing it to see how close you’re getting to the 25.6 GB/s you’d get if the convolution params were cached and you got perfect cache usage of the overlapping reads into the 7x7 tiles on the image.

You could try writing your own kernel with the 7x7 kernel preloaded into shared memory and see if you can do it faster.

As @ratzes points out, your code may be memory bandwidth limited on the GPU. If the CUDA profiler works with the Jetson Nano, I would suggest using that to find out what the limiting factors are.

Side remark: Questions pertaining to the various Jetson platforms typically receive faster and more in-depth answers in the sub-forums dedicated to those embedded platforms:

Thank you for your answer. Actually, in our final application, to every new 640x512 image coming every 200ms from a sensor we have to apply 3 different convutions with 3 different 7x7 matrices. Is it possible to transfer the image once and apply the three convolutions most at the same time using GPU? Is it possible to do it using cv::cuda::convolution class or is it necessary to write my own routines at CUDA level? Thank youl! LuigiDo