Issues with nppiMean_StdDev_32f from the NPP library

Hello.
I am currently using the function nppiMean_StdDev_32f_C1MR() from the NPP library to compute the mean and the standard deviation of all elements of a matrix of floats, with a mask.
I noticed though, that results are NOT correct when using matrices of more than 50 rows (for example). Basically I get strange results for mean and NaN for std.dev.
Given the fact it works for smaller matrix sizes and it doesn’t for bigger matrices I really think it is a bug of the nppiMean_StdDev_32f() function.
Note however that nppiMean_32f_C1MR() provides the correct results for any size of the matrices, but I can only get the mean and NOT the standard deviation.

Did anybody experience such issue with the nppiMean_StdDev_32f*() function(s)?
I can provide the source code I used so far.

Thank you,
Alex

Yes, source code (hopefully short) would help.

Note that the same behavior is reproduced on a GeForce with CUDA 5.0 and a Tesla with CUDA 6.0.

BTW, Is the NPP library source code available for the public?

Below you can find the function code, as requested, (from a bigger project using OpenCV) that uses the NPP library functions (the code below won’t compile per se, but it’s easy to follow).
Basically, everything works OK when I use smaller matrix sizes (e.g., sz.height = 50;) and it does NOT for bigger matrices (sz.height = src.rows;). Since this rules out any matrix memory misalignment OR using wrong pointers (host instead of device pointers), etc, I really think it is a bug of the nppiMean_StdDev_32f() function.
When nppiMean_StdDev_32f() fails I get strange results for mean (e.g., 8192.0 instead of 4.0) and NaN for std deviation.
Again, nppiMean_32f() below works well (for ALL matrix sizes), it’s just the nppiMean_StdDev_32f() that gets screwed.

I'd be happy to provide any further info.

void GpuMatMeanStdDevWrapper(const cv::gpu::GpuMat& src,
Scalar &mean, Scalar &stddev,
const cv::gpu::GpuMat& mask) {

// Note: cv::gpu::GpuMat is documented at http://docs.opencv.org/modules/gpu/doc/data_structures.html#gpu-gpumat

CV_Assert(src.type() == CV_32F);
CV_Assert(mask.type() == CV_8U);

CV_Assert(mask.rows == src.rows);
CV_Assert(mask.cols == src.cols);


NppiSize sz; // 2D size from the NPP library
sz.width  = src.cols;
sz.height = src.rows; // 50 - for this meanStd() works ok with the 1024 x 768 matrix (Note: sum() always works ok so far)

void GpuMatMeanStdDevWrapper(const cv::gpu::GpuMat& src,
Scalar &mean, Scalar &stddev,
const cv::gpu::GpuMat& mask) {

// Note: cv::gpu::GpuMat is documented at http://docs.opencv.org/modules/gpu/doc/data_structures.html#gpu-gpumat

CV_Assert(src.type() == CV_32F);
CV_Assert(mask.type() == CV_8U);

CV_Assert(mask.rows == src.rows);
CV_Assert(mask.cols == src.cols);


NppiSize sz; // 2D size from the NPP library
sz.width  = src.cols;
sz.height = src.rows; // 50 - for this meanStd() works ok with the 1024 x 768 matrix (Note: sum() always works ok so far)

int bufSize;

assert(nppiMeanStdDevGetBufferHostSize_32f_C1MR(sz, &bufSize) == 0);

Npp8u *pDeviceBuffer;
assert(cudaMalloc((void **)&pDeviceBuffer, 4 * bufSize * sizeof(Npp8u)) == 0);

cudaDeviceSynchronize();
cudaThreadSynchronize();

Npp64f h_res[16];
h_res[0] = -1;
h_res[1] = -1;
h_res[8] = -1;

Npp64f *dbuf; // Note: Npp64f is double (see NPP_library.pdf)
assert( cudaMalloc((void **)&dbuf, 16 * sizeof(Npp64f)) == 0 );

cudaDeviceSynchronize();
cudaThreadSynchronize();

NppStatus status;
status = nppiMean_32f_C1MR((Npp32f *)src.ptr<Npp32f>(), // src2
                                    static_cast<int>(src.step), // pitch
                                    (Npp8u *)mask.ptr<Npp8u>(),
                                    static_cast<int>(mask.step),
                                    sz,
                                    pDeviceBuffer,
                                    dbuf);

cudaDeviceSynchronize();
cudaThreadSynchronize();

// Copy values to host
double h_sum;
cudaMemcpy(&h_sum , dbuf , sizeof(Npp64f), cudaMemcpyDeviceToHost);

printf("h_sum (actually mean) = %.7f\n", h_sum); //h_mean);

sz.height = 50;
//sz.height = src.rows; // 50 - for this meanStd() works ok with the 1024 x 768 matrix (Note: sum() always works ok so far)

// From NPP_Library.pdf (https://developer.nvidia.com/sites/default/files/akamai/cuda/files/CUDADownloads/NPP_Library.pdf):
//    page 990
//7.81.1.5
//  1-channel 32-bit floating-point image mean and standard deviation, Masked Operation.
//
// Parameters:
//  pSrc      - Source-Image Pointer.
//  nSrcStep  - Source-Image Line Step. "The source image line step is the number of bytes between successive rows in the image."
//  pMask     - Mask-Image Pointer.
//  nMaskStep - Mask-Image Line Step.
//  oSizeROI  - Region-of-Interest (ROI). "This raises the obvious question how the primitive knows where in the image this rectangle of (width, height) is located. The "start pixel" of the ROI is implicitly given by the image-data pointer."
//  pDeviceBuffer - Pointer to the required device memory allocation, Scratch Buffer and Host Pointer
//        Use nppiMeanStdDevGetBufferHostSize_8u_C1MR to determine the minium number of bytes required.
//  pMean     - Contains computed mean.
//  pStdDev   - Contains computed standard deviation.

NppStatus res = nppiMean_StdDev_32f_C1MR( (Npp32f *)src.ptr<Npp32f>(), //src2
                                            static_cast<int>(src.step), // pitch
                                            (Npp8u *)src.ptr<Npp8u>(),
                                            static_cast<int>(mask.step),
                                            sz,
                                            pDeviceBuffer,
                                            dbuf, dbuf + 8);

cudaDeviceSynchronize();
cudaThreadSynchronize();

printf("GpuMatMeanStdDevWrapper(): res of nppiMean_StdDev_32f_C1MR() = %d\n", res);

assert(cudaMemcpy(&h_res[0], dbuf, 16 * sizeof(Npp64f), cudaMemcpyDeviceToHost) == 0);

printf("mean (new) = %.7lf\n", h_res[0]); //h_mean);
printf("mean(int) = %ld\n", *(long *)&h_res[0]); //h_mean);
printf("stddev = %.7lf\n", h_res[8]); //h_dev);
printf("stddev(int) = %ld\n", *(long *)&h_res[8]); //&h_dev);

}

OK your original question didn’t mention anything about OpenCV. If you want to provide a simple test case that is self contained, complete, and compilable, and demonstrates a numerical error, and depends only on CUDA and NPP, I’ll take a look. Otherwise maybe someone else will know the answer. Perhaps you should just file a bug on the developer portal. But they will want a complete, compilable test case, and possibly one that doesn’t depend on OpenCV.

Hi.
The fact I use OpenCV should not influence much the results - basically GpuMat is a pitched 2D-array and nothing more.
Note that the error happens when NOT using the OpenCV library - if you want I can post the code without OpenCV dependencies.

BUT I really think I found an error of the nppiMeanStdDev…() NPP library function since for small matrix sizes (50 x 50) the result is correct and otherwise not. Also I am able to use successfully nppiMean() and nppiSum() functions - results obtained are correct.

If anybody has experienced similar behavior please share a thought.
Thank you.

If you want to post a complete code that I can copy, paste, compile, and run, that doesn’t depend on OpenCV, and see the error you are talking about, (it should include expected and actual results) I’ll take a look.

This code compiles, executes, but does not give the expected results. It should be std_f = 0 and mean_f = 50

Windows 10 64 bits
Visual Studio 2012 ultimate
CUDA toolkit 7.0
GPU Quadro K4200 (two of them)

It does not work either with sizes od 1920 x 1080 (which is what I need).

The display dirver seems to reboot (or somehting like that) after execution.

Any help?

Oscar

#include <cuda.h>
#include <cuda_runtime.h>
#include
#include

#include <npp.h>

int main(int argc, char* argv)
{
NppiSize total_npp;

int scratchBuffSize;

Npp8u *d_scratch;

Npp64f mean_f = 13.0, std_f = 2.0;

float * d_input;
unsigned char * d_mask;

NppStatus err; 

total_npp.width = 32;
total_npp.height = 32;

scratchBuffSize = 0;

nppiMeanStdDevGetBufferHostSize_32f_C1MR(total_npp, &scratchBuffSize);

cudaMalloc((void **)&d_scratch, scratchBuffSize * sizeof(Npp8u));

std::vector<float> h_input2(32*32, 50.0);
std::vector<unsigned char> h_mask2(32*32, 1);

cudaMalloc((void **)&d_input, sizeof(float)*(32*32));
cudaMalloc((void **)&d_mask, sizeof(unsigned char)*(32*32));

cudaMemcpy(d_input, h_input2.data(), (32*32)*sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(d_mask, h_mask2.data(), (32*32)*sizeof(unsigned char), cudaMemcpyHostToDevice);

err = nppiMean_StdDev_32f_C1MR(d_input, 32*sizeof(float), d_mask, 32*sizeof(unsigned char), total_npp, d_scratch, &mean_f, &std_f);

cudaDeviceSynchronize();

std::cout << "err =" << int(err) << std::endl;

std::cout << "mean_f = " << mean_f << std::endl;
std::cout << "std_f = " << std_f << std::endl;

cudaFree(d_scratch);
cudaFree(d_input);
cudaFree(d_mask);

return 0;

}

I suggest that you do proper CUDA error checking in your code, before asking others for help. If you don’t know what “proper CUDA error checking” is, then google “proper CUDA error checking”, and take the first hit.

Anyway the “reboot” of the display driver probably means you are hitting a windows TDR which is discussed in this forum thread:

https://devtalk.nvidia.com/default/topic/459869/cuda-programming-and-performance/-quot-display-driver-stopped-responding-and-has-recovered-quot-wddm-timeout-detection-and-recovery-/

If you google “cuda tdr” and take the first hit, you will likely get additional useful information about how to address it.

Well, that was helpfull. Thanks!

Ok

So, this is the output with (I hope) “proper CUDA error checking”:

Size of scratch buffer = 392
nppiMean_StdDev_32f_C1MR error output = 0
GPUassert: an illegal memory access was encountered NppBug_Mean_StdDev_32f_C1MR.cpp 82
Error number 77

So, it seem’s an out-of-bounds problem, that I can’t locate. I reviewed all the code and, according to the NPP documentation page 1828, section 7.101.2.5, all the parameters I’m passing to “nppiMean_StdDev_32f_C1MR” are correct.

Now it is easy to change the size of the input parameters with the defines I added to the code. The results are always the same.

I hope I’m making an embarrasing mistake, and this can be solved fast. Otherwise, if there is a bug in “nppiMean_StdDev_32f_C1MR” I would like to know.

Along with the code, I add a textual review of what I understad is expecting the npp function as parameters.

Thankyou very much.

Oscar

#include <npp.h>

#include <cuda.h>
#include <cuda_runtime.h>
#include <iostream>
#include <vector>

#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code,
                      const char *file,
                      int line,
                      bool abort = true) {
    if (code != cudaSuccess) {
        fprintf(stderr, "GPUassert: %s %s %d\n",
            cudaGetErrorString(code), file, line);
        fprintf(stderr, "Error number %d\n", code);
        if (abort) exit(code);
    }
}

#define X_SIZE 128
#define Y_SIZE 16
#define TOTAL_SIZE X_SIZE * Y_SIZE

int main(int argc, char* argv[]) {
    NppiSize total_npp;

    int scratchBuffSize;

    Npp8u *d_scratch;

    Npp64f mean_f = 13.0, std_f = 2.0;

    Npp32f * d_input;
    Npp8u * d_mask;

    NppStatus err;

    total_npp.width = X_SIZE;
    total_npp.height = Y_SIZE;

    scratchBuffSize = 0;

    nppiMeanStdDevGetBufferHostSize_32f_C1MR(total_npp, &scratchBuffSize);

    std::cout << "Size of scratch buffer = " << scratchBuffSize << std::endl;

    gpuErrchk(cudaMalloc(reinterpret_cast<void **>(&d_scratch),
        scratchBuffSize * sizeof(Npp8u)));

    Npp32f * h_input2;
    Npp8u* h_mask2;

    h_input2 = reinterpret_cast<Npp32f*>(malloc(sizeof(Npp32f) * TOTAL_SIZE));
    h_mask2 =
        reinterpret_cast<Npp8u*>(malloc(sizeof(Npp8u) * TOTAL_SIZE));

    for (int i = 0; i < TOTAL_SIZE; i++) {
        h_input2[i] = 50.0;
        h_mask2[i] = 1;
    }

    gpuErrchk(cudaMalloc(reinterpret_cast<void **>(&d_input),
        sizeof(Npp32f) * TOTAL_SIZE));
    gpuErrchk(cudaMalloc(reinterpret_cast<void **>(&d_mask),
        sizeof(Npp8u) * TOTAL_SIZE));

    gpuErrchk(cudaMemcpy(d_input, h_input2,
        TOTAL_SIZE * sizeof(Npp32f), cudaMemcpyHostToDevice));
    gpuErrchk(cudaMemcpy(d_mask, h_mask2,
        TOTAL_SIZE * sizeof(Npp8u), cudaMemcpyHostToDevice));

    err = nppiMean_StdDev_32f_C1MR(d_input, X_SIZE * sizeof(Npp32f),
                                    d_mask, X_SIZE * sizeof(Npp8u),
                                    total_npp, d_scratch,
                                    &mean_f, &std_f);

    std::cout << "nppiMean_StdDev_32f_C1MR error output = "
        << int(err) << std::endl;

    gpuErrchk(cudaPeekAtLastError());
    gpuErrchk(cudaDeviceSynchronize());

    std::cout << "mean_f = " << mean_f << std::endl;
    std::cout << "std_f = " << std_f << std::endl;

    gpuErrchk(cudaFree(d_scratch));
    gpuErrchk(cudaFree(d_input));
    gpuErrchk(cudaFree(d_mask));

    return 0;
}

nppiMean_StdDev_32f_C1MR parameter analysis:

d_input:
source image pointer.
It is a GPU 1D array in memory of size X_SIZE * Y_SIZE and type Npp32f.
Indexed as a 2D image by the kernel.

X_SIZE * sizeof(Npp32f):
source image line step.
As we don’t add any padding, the line step equals to X_SIZE * sizeof(Npp32f)

d_mask:
mask image pointer.
It is a GPU 1D array in memory of size X_SIZE * Y_SIZE and type Npp8u.
Indexed as a 2D image by the kernel.

X_SIZE * sizeof(Npp8u):
mask line step.
As we don’t add any padding, the line step equals to X_SIZE * sizeof(Npp8u)

total_npp:
region of interest.
It is an struct with integer (int) components “widht” and “height”.
The region of interest in our case is all the image. Therefore “width = X_SIZE” and “height = Y_SIZE”.

d_scratch:
pointer to GPU scratch memory.
The size of this pointer is calculated with total_npp, using the function “nppiMeanStdDevGetBufferHostSize_32f_C1MR”, as told in the documentation.

&mean_f:
pointer to the computed mean.
Type Npp64f.
Initialized in the Host code so the variable is assigned a valid address in memory.
It should be “50.0” after execution.

&std_f:
pointer to the computed standard deviation.
Type Npp64f.
Initialized in the Host code so the variable is assigned a valid address in memory.
It should be “0” after execution.

This problem is due to the fact that you are passing host-based pointers for the mean and standard dev. result values:

err = nppiMean_StdDev_32f_C1MR(d_input, X_SIZE * sizeof(Npp32f),
                                    d_mask, X_SIZE * sizeof(Npp8u),
                                    total_npp, d_scratch,
                                    &mean_f, &std_f);
                                     ^^^        ^^^

Those should be device pointers. The following code runs without runtime error for me:

$ cat t1124.cu
#include <npp.h>

#include <cuda.h>
#include <cuda_runtime.h>
#include <iostream>
#include <vector>
#include <stdio.h>
#include <iostream>

#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code,
                      const char *file,
                      int line,
                      bool abort = true) {
    if (code != cudaSuccess) {
        fprintf(stderr, "GPUassert: %s %s %d\n",
            cudaGetErrorString(code), file, line);
        fprintf(stderr, "Error number %d\n", code);
        if (abort) exit(code);
    }
}

#define X_SIZE 128
#define Y_SIZE 16
#define TOTAL_SIZE X_SIZE * Y_SIZE

int main(int argc, char* argv[]) {
    NppiSize total_npp;

    int scratchBuffSize;

    Npp8u *d_scratch;

    Npp64f mean_f = 13.0, std_f = 2.0, *d_mean_f, *d_std_f;
    cudaMalloc(&d_mean_f, sizeof(Npp64f));
    cudaMalloc(&d_std_f, sizeof(Npp64f));

    Npp32f * d_input;
    Npp8u * d_mask;

    NppStatus err;

    total_npp.width = X_SIZE;
    total_npp.height = Y_SIZE;

    scratchBuffSize = 0;

    nppiMeanStdDevGetBufferHostSize_32f_C1MR(total_npp, &scratchBuffSize);

    std::cout << "Size of scratch buffer = " << scratchBuffSize << std::endl;

    gpuErrchk(cudaMalloc(reinterpret_cast<void **>(&d_scratch),
        scratchBuffSize * sizeof(Npp8u)));

    Npp32f * h_input2;
    Npp8u* h_mask2;

    h_input2 = reinterpret_cast<Npp32f*>(malloc(sizeof(Npp32f) * TOTAL_SIZE));
    h_mask2 =
        reinterpret_cast<Npp8u*>(malloc(sizeof(Npp8u) * TOTAL_SIZE));

    for (int i = 0; i < TOTAL_SIZE; i++) {
        h_input2[i] = 50.0;
        h_mask2[i] = 1;
    }

    gpuErrchk(cudaMalloc(reinterpret_cast<void **>(&d_input),
        sizeof(Npp32f) * TOTAL_SIZE));
    gpuErrchk(cudaMalloc(reinterpret_cast<void **>(&d_mask),
        sizeof(Npp8u) * TOTAL_SIZE));

    gpuErrchk(cudaMemcpy(d_input, h_input2,
        TOTAL_SIZE * sizeof(Npp32f), cudaMemcpyHostToDevice));
    gpuErrchk(cudaMemcpy(d_mask, h_mask2,
        TOTAL_SIZE * sizeof(Npp8u), cudaMemcpyHostToDevice));

    err = nppiMean_StdDev_32f_C1MR(d_input, X_SIZE * sizeof(Npp32f),
                                    d_mask, X_SIZE * sizeof(Npp8u),
                                    total_npp, d_scratch,
                                    d_mean_f, d_std_f);

std::cout << "nppiMean_StdDev_32f_C1MR error output = "
        << int(err) << std::endl;

    gpuErrchk(cudaPeekAtLastError());
    gpuErrchk(cudaDeviceSynchronize());
    cudaMemcpy(&mean_f, d_mean_f, sizeof(Npp64f), cudaMemcpyDeviceToHost);
    cudaMemcpy(&std_f, d_std_f, sizeof(Npp64f), cudaMemcpyDeviceToHost);
    std::cout << "mean_f = " << mean_f << std::endl;
    std::cout << "std_f = " << std_f << std::endl;

    gpuErrchk(cudaFree(d_scratch));
    gpuErrchk(cudaFree(d_input));
    gpuErrchk(cudaFree(d_mask));

    return 0;
}
$ nvcc -o t1124 t1124.cu -lnppi
$ cuda-memcheck ./t1124
========= CUDA-MEMCHECK
Size of scratch buffer = 392
nppiMean_StdDev_32f_C1MR error output = 0
mean_f = 50
std_f = 0
========= ERROR SUMMARY: 0 errors
$

It seems like the computed values are probably correct for your provided input data, but I haven’t studied it carefully.

If you run your original code with cuda-memcheck, you will get some additional error information. In particular it points out that the underlying illegal access was an invalid global write of size 8 (bytes). If you study all the input parameters you provided to the nppiMean_StdDev_32f_C1MR function, and consider which of those would likely be involved in the function writing an 8-byte value, you’ll pretty quickly focus your attention on the last 2 parameters.

Thankyou very much! It was very usefull.

Unfortunatelly, now I’m having problems with the ROI sizes. When I use images of 1920x1080 with ROI the same size (1920x1080) the npp function returns an step error.

Instead, when I change the ROI size to 1024x1024 or 1040x1024, it works with correct results (for the values used according to the ROI).

I attach the code that fails. To see it working just change the #define X_ROI and #define Y_ROI to 1024 and 1024.

I understand it should work with 1920x1080. The scatch buffer size is almost 26KB, it shouldn’t be the problem.

#include <npp.h>

#include <cuda.h>
#include <cuda_runtime.h>
#include <iostream>
#include <vector>

#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code,
                      const char *file,
                      int line,
                      bool abort = true) {
    if (code != cudaSuccess) {
        fprintf(stderr, "GPUassert: %s %s %d\n",
            cudaGetErrorString(code), file, line);
        fprintf(stderr, "Error number %d\n", code);
        if (abort) exit(code);
    }
}

#define X_SIZE 1920
#define Y_SIZE 1080
#define X_ROI 1920
#define Y_ROI 1080
#define TOTAL_SIZE X_SIZE * Y_SIZE

int main(int argc, char* argv[]) {
    NppiSize total_npp;

    int scratchBuffSize;

    Npp8u *d_scratch;

    Npp64f mean_f = 13.0, std_f = 2.0, *d_mean_f, *d_std_f;
    cudaMalloc(&d_mean_f, sizeof(Npp64f));
    cudaMalloc(&d_std_f, sizeof(Npp64f));

    Npp32f * d_input;
    Npp8u * d_mask;

    NppStatus err;

    total_npp.width = X_ROI;
    total_npp.height = Y_ROI;

    scratchBuffSize = 0;

    nppiMeanStdDevGetBufferHostSize_32f_C1MR(total_npp, &scratchBuffSize);

    std::cout << "Size of scratch buffer = " << scratchBuffSize << std::endl;

    gpuErrchk(cudaMalloc(reinterpret_cast<void **>(&d_scratch),
        scratchBuffSize * sizeof(Npp8u)));

    Npp32f * h_input2;
    Npp8u* h_mask2;

    h_input2 = reinterpret_cast<Npp32f*>(malloc(sizeof(Npp32f) * TOTAL_SIZE));
    h_mask2 =
        reinterpret_cast<Npp8u*>(malloc(sizeof(Npp8u) * TOTAL_SIZE));

    for (int i = 0; i < TOTAL_SIZE; i++) {
        h_input2[i] = 50.0;
        h_mask2[i] = 1;
    }

    gpuErrchk(cudaMalloc(reinterpret_cast<void **>(&d_input),
        sizeof(Npp32f) * TOTAL_SIZE));
    gpuErrchk(cudaMalloc(reinterpret_cast<void **>(&d_mask),
        sizeof(Npp8u) * TOTAL_SIZE));

    gpuErrchk(cudaMemcpy(d_input, h_input2,
        TOTAL_SIZE * sizeof(Npp32f), cudaMemcpyHostToDevice));
    gpuErrchk(cudaMemcpy(d_mask, h_mask2,
        TOTAL_SIZE * sizeof(Npp8u), cudaMemcpyHostToDevice));

    err = nppiMean_StdDev_32f_C1MR(d_input, X_SIZE * sizeof(Npp32f),
                                    d_mask, Y_SIZE * sizeof(Npp8u),
                                    total_npp, d_scratch,
                                    d_mean_f, d_std_f);

    std::cout << "nppiMean_StdDev_32f_C1MR error output = "
        << int(err) << std::endl;

    gpuErrchk(cudaPeekAtLastError());
    gpuErrchk(cudaDeviceSynchronize());
    gpuErrchk(cudaMemcpy(&mean_f, d_mean_f, sizeof(Npp64f), cudaMemcpyDeviceToHost));
    gpuErrchk(cudaMemcpy(&std_f, d_std_f, sizeof(Npp64f), cudaMemcpyDeviceToHost));
    std::cout << "mean_f = " << mean_f << std::endl;
    std::cout << "std_f = " << std_f << std::endl;

    gpuErrchk(cudaFree(d_scratch));
    gpuErrchk(cudaFree(d_input));
    gpuErrchk(cudaFree(d_mask));

    return 0;
}

Thankyou again!

The error in your code is here:

err = nppiMean_StdDev_32f_C1MR(d_input, X_SIZE * sizeof(Npp32f),
                                d_mask, Y_SIZE * sizeof(Npp8u),
                                         ^
                                         |
                                         ?

I’m not sure why you are using Y_SIZE for calculation of the mask line step. The linestep should be calculated as the X-dimension times the size of the pixel in bytes, just as you have calculated for the source line step.

So it’s not surprising to me that the npp function returns a step error.

When I change Y_SIZE in the above code to X_SIZE, the error goes away for me.

You are right sorry. I was trying to reproduce a problem I have in a code I can’t show, with this example, and the Y_SIZE was clearly a typo.

The other code uses OpenCV, and other npp functions work perfectly with it, passing the step that OpenCV provides with this syntax “your_opencv_data.step”. I’m checking all the sizes and they are correct, but this npp function only works with small roi sizes.

Additionally, even with small roi sizes, it runs slower than a sequence of OpenCV gpu calls, for calculating mean and variance, so I’m dismissing this npp function for now.

Thankyou very much for all the help anyway.

I would just like to say that I’m seeing the same behavior. With a 30 megapixel ROI, the standard deviation returned by nppiMean_StdDev_16u_C1R is -nan(ind). The mean seems fine.

Hard-coding even a 1x1 ROI returns a nan std. Mean is identical to the pixel value.