NPP boxfilter performance

I am relatively new at CUDA development. So excuse the question.

I am using OpenCV to which calls NPP. In my code I am trying to use a basic box filter (cv::blur which becomes cv::cuda:createBoxFilter). But, I have very large kernel sizes (typical is above 100 and often can be larger than 500.

What I noticed is that when box filter is smaller than 32ish the GPU is faster but if I use large kernel sizes then the CPU is much much faster faster at kernel sizes of 500 the boxfilter takes almost a second. This is not just on the first pass (initializing) this is all other times.

At first I thought it was my code then I verified it with OpenCV’s filter performance tests by adding large kernels.

My home PC is RTX2060 SUPER and the CPU is a AMD Ryzen 7. Eventually this code will run on an RTX4000 or RTX5000.

Any thoughts? I am trying to no reinvent the wheel.