Why does NPP median get very slow with kernel width 9 or more?

I’m doing image processing on RGB float images with NPP, specifically nppiFilterMedianGetBufferSize_32f_C3R. With a kernel size of 7 on a 4096x3432 image, performance is quite good, 127ms on our A6000 GPU. But changing kernel size to 9 suddenly performance is over 10x worse, 1722 msec. I understand performance scales as k^2, but this is way worse than expected. With kernel size of 9, GPU performance is actually worse than our high-end CPU running a multithreaded median based on std::nth_element. I don’t suppose it’s possible to see the NPP source? Should I look into coding my own median using shared mem? I was expecting NPP to be well optimized.

my guess would be that there is an issue related to shared memory utilization. You might be able to discover it via profiling each case, since the profiler can report both static and dynamic shared memory use for each kernel launch.

Each thread may need to do an independent sort over the kernel-size elements. So 7x7 kernel size is 49 elements per thread, and at 1024 threads that would be roughly what is usable per threadblock. So it’s possible that for very large kernel sizes, shared memory usage for the particular algorithm no longer is an option, and some alternate, slower method is used.

It’s just a guess. And yes, I acknowledge that if you restrict yourself to e.g. 128 threads per block, you can have a lot more shared memory per thread. Plus an A6000 has ways to get more than 48KB per threadblock. So the theory may not be correct. But the logic of simply choosing smaller threadblock sizes may have performance implications as well, due to reduced occupancy.