npp and parallelism


Is it possible to use npp functions and cuda kernels in parallel or will the npp functions always take over the whole gpu?

Are npp functions expected to be faster than cuda kernels?

I reimplemented several npp functions as kernels and got much better performance (but I don’t know if everything is the same under the hood). Do npp functions use fastmath by default?



NPP is a NVIDIA accelerated version of the IPP.
We can get 30x faster than IPP but we don’t compare it with other CUDA implementation.

Suppose this should be case-dependent.
Since both implementation uses GPU, the performance should differ from the implementation.
In general, our library will have a better performance than third-party implementation.

NPP doesn’t consider GPU loading and by default use the maximal GPU resource.