NPP, average, median, std dev, and more

Could the release version of NPP 4.0 have average, median, and standard deviation programmed in for all the data types (there is one 8 bit version in there)?

It saves a lot of programming time.

I realize this might work out in thrust more easily, it just seems to involve learning another library and converting to C++.

Are you requesting those functions for image processing or for signal (vector) processing? You’re mentioning Thrust, which made me think you are talking about signal processing.

In any case, I’m glad to hear that NPP saves you programming time and we’re definitely adding more functionality to NPP with every release. The more specific you can be in your requests for new functions, the better your chances of us adding what you need.



Well, yeah, I’m using it’s signal processing capabilities, although I may use the image processing ones too.

I’d really be looking for (this is changed from nppsMinMax_32f):

NppStatus nppsAvgStdDev_32f (const Npp32f *pSrc, int nLength, Npp32f *pAvg, Npp32f *pStdDev, Npp8u *pDeviceBuffer)

However, NPP seems to be crashing on me when I set it to the sizes I need, which are vectors with larger than 125 million floats.

It seems to work okay with vectors of 10 million floats though. It seems to fail when I use some combinations of:

However, it seems to work when I use around 10,000,000 floats instead.

Is there some size limit?

I did look at the implementations of the functions in question. There is nothing obvious that would explain the size limitation you’re experiencing.

Some of our oldest implementations used 1D textures for data input to the kernels, which have a size limit, but not the ones you’re having trouble with.

In order for me to even attempt to reproduce this, I would need to know what HW and OS you’re running this on, and how the crash manifests itself, e.g. are there error return codes? Interal exceptions? Can you step through this in a debugger and nail down which function invocation causes the problem?

Even then, it may be difficult for me to reproduce if it is some kind of interaction between your kernels, e.g. if you don’t allocate memory of sufficient size, kernels may overwrite memory and corrupting general state. Obviously, if you could post the code for a simple stand-alone repro of the issue, that improves the chance of us getting this figured out and fixed.


PS: I forgot to properly answer your original question, which was if we could add a StdDev primitive to the 4.0 final release. Sadly, no. Our 4.0 release branch has been in feature freeze for a while and I cannot add new functionality at this point. I did put your request onto our 4.1 list and there’s a decent chance that we’ll get to implementing for that release.

I’m using CUDA 4.0 RC2 on CentOS 5.4 (Linux).

Here’s a sample program that fails:

#include <cuda.h>

#include <npp.h>

#include <stdio.h>

int main(void ) {

  float *x;

  cudaMalloc((void **) &x, 125000000*sizeof(float));

  NppStatus S;


  if(S != NPP_SUCCESS) {

        printf("NPP Error line %d Npp Err no: %d\n", __LINE__, S);




  if(S != NPP_SUCCESS) {

        printf("NPP Error line %d Npp Err no: %d\n", __LINE__, S);



  if(S != NPP_SUCCESS) {

        printf("NPP Error line %d Npp Err no: %d\n", __LINE__, S);




  if(S != NPP_SUCCESS) {

        printf("NPP Error line %d Npp Err no: %d\n", __LINE__, S);



To compile

nvcc -O2 -lnpp -o test

Here’s the output:

NPP Error line 23 Npp Err no: -3

Thanks for the reproducer. I filed a bug for this internally. We’ll post an update here. The schedule might be very tight to get a fix for the 4.0 final release. I would assume this to be fixed in 4.1.

Have you tried to work around this issue, but simply calling the primitive multiple times on sections of the array?

Well, it defeats the purpose to have to recall the primitive so many times… The speed and legibility of the code will go the wrong way by too large amount.

I’m already migrating to Thrust anyways.

I just kind of prefered using straight C for the legibility versus thrust.

Well, sad to hear you’re moving “to the competition” ;-).

Anyways, we have implemented a fix for this issue which will ship with the upcoming 4.1 CUDA Toolkit.