I am relatively new at CUDA development. So excuse the question.
I am using OpenCV to which calls NPP. In my code I am trying to use a basic box filter (cv::blur which becomes cv::cuda:createBoxFilter). But, I have very large kernel sizes (typical is above 100 and often can be larger than 500.
What I noticed is that when box filter is smaller than 32ish the GPU is faster but if I use large kernel sizes then the CPU is much much faster faster at kernel sizes of 500 the boxfilter takes almost a second. This is not just on the first pass (initializing) this is all other times.
At first I thought it was my code then I verified it with OpenCV’s filter performance tests by adding large kernels.
My home PC is RTX2060 SUPER and the CPU is a AMD Ryzen 7. Eventually this code will run on an RTX4000 or RTX5000.
Any thoughts? I am trying to no reinvent the wheel.