Implementation details about nppiFilterMedian

Hi,

I am trying to embed nppiFilterMedian_16u_C1R into my real-time processing pipeline, but it works too slow for me. I am using CUDA 10.0 on Jetson Nano.

Where I can find more information about the implementation of nppiFilterMedian_16u_C1R? In particular, I have two questions:

  1. What algorithm is used to find median values? Is it the most efficient algorithm possible on CUDA?
  2. How the border case is handled? Is the input matrix padded with zeros?

Cheers