I think NPP stands nowhere when compared to IPP when you look at the functionality.
I have come across very few posts talking about NPP in the forums as well…
So, I would guess that NPP development is not in par with CUDA…
May be even within NVIDIA, it may get the least priority…
We’re looking into NPP because we currently use IPP already so converting functions has been relatively easy and the performance gains substantial. I like it because it is abstracting away deep knowledge of GPU programming and providing primitives to build more complex algorithms. I do wish NPP supported more IPP functions, but I guess I just have to hope that NVIDIA will continue development on NPP and wait for new releases. I asked one of the developers on NPP (Frank Jargstorff) about the Hough transform since we use that too and this is what he said on 2/16/11:
I actually have a question and a comment - I am also porting some functions from IPP to NPP, but I have found that I am not getting the performance gains that I expect. For example, I am trying to do a VERY SIMPLE vector addition. In other words, vectorA + vectorB = vectorC. In IPP, I use “ippsAdd_32f”. In NPP, I am forced to use the “nppiAdd_32f_C1R” command that is reserved for image processing to simply add two vectors together.
The first thing is that I am surprised there isnt a normal vector/vector addition function in NPP without having me resort to using an image processing command…
The second thing I noticed is that if my vectors are more than 65535*8 = 524280 samples, (floats), I get nppStatus error “NPP_CUDA_KERNEL_EXECUTION_ERROR”. (Also known as -3). I know that 65535 is the max number of blocks, and that 8 is the number of cores per processor, but its sort of weird that I can never use NPP with vectors greater than 524280!!!..
Finally, the third thing, is that I timed the two additions, one using ippsAdd_32f, and the other using nppiAdd_32f_C1R, and the IPP is faster.
I will be curious to hear about your thoughts and any insights into how you have used NPP.
I have good news for you. With release 4.0 we are providing a complete set of signal-arithmetic primitives.
Your workaround of using an image-processing primitive for doing vector operations is problematic, not only due to the size restrictions, but also because our strategy of tiling CTAs over the image region being processed would result in sub-par performance for images of hight 1.
Thanks for your msg - as it turns out, I installed CUDA 4.0 just last night, and got it up and running with my code, and used “nppsAdd_32f” for my very simple vector-vector addition experiment.
That being said, I noticed a couple things that perplex me so far:
First thing is, I again noticed a size restriction on my addition. Basically, I noticed that I cannot add two vectors if their lengths are greater than 65535*512 = 33553920 samples (floats). (I have enough device RAM to store a lot more that that btw).
-Is there really a size restriction?
-If there is, how would I go about using NPP to add vectors of lengths greater than that?
-NPP is still slower than the IPP equivalent. (By a factor of 6x).
The second issue, (and this is one that really has me wondering) - I went ahead and timed (using the GPU timers provided in best-practices documents) two separate functions. One function I wrote myself that adds two vectors together, (its very very simple and really just a copy/paste from ‘learn cuda by example’ book), and the other function using nppsAdd_32f. I call my add function using 128 blocks and 128 threads/block. Guess what - MY function is 3x faster than NPP! :-/
-What gives here??
-(Related question - thinking out loud here for possible reasons: Is there a way to toggle threads or blocks used by NPP in a similar way to how you can set max threads in IPP? Is that maybe why its performance is still bad?)
Thanks in advance Frank and I really do appreciate your feedback since we are really interested in using NPP/CUDA for our apps!