NPP boxfilter performance

I am relatively new at CUDA development. So excuse the question.

I am using OpenCV to which calls NPP. In my code I am trying to use a basic box filter (cv::blur which becomes cv::cuda:createBoxFilter). But, I have very large kernel sizes (typical is above 100 and often can be larger than 500.

What I noticed is that when box filter is smaller than 32ish the GPU is faster but if I use large kernel sizes then the CPU is much much faster faster at kernel sizes of 500 the boxfilter takes almost a second. This is not just on the first pass (initializing) this is all other times.

At first I thought it was my code then I verified it with OpenCV’s filter performance tests by adding large kernels.

My home PC is RTX2060 SUPER and the CPU is a AMD Ryzen 7. Eventually this code will run on an RTX4000 or RTX5000.

Any thoughts? I am trying to no reinvent the wheel.

I pure use C++ to call nppiFilterBoxBorder_8u_C1R. it is the same problem for large
kernel. I am so confused that. My image size is 240M, when i use kernel size 13. it almost spend 1.sec. When I use kernel size is 301, It spends 124.xxxxx sec. My vga card is GeForce 1650 Ti with Max-Q design.