CUDA 4.0 NPP giving wrong answers : NPP bug possibly ?

I was trying out NPP v4.0 bundled with CUDA 4.0 RC, —

— In the code shown below, I do a simple cudaMalloc for a 5x5 Npp8u matrix data and use nppiAdd_8u_C1RSfs to do a ROI addition between two images of size 5x5 for ROI size { 2,2 }. Result is not as expected.

The NPP documentation says that NPP API functions will work even if the image is not well-aligned. Also, the documentation specifies Image Data Alignment Requirements which I think this code snippet satisfies.

4.2.1 Line Step

The line step (also called “line stride” or “row step”) allows lines of oddly sized images to start on wellaligned addresses by adding a number of unused bytes at the ends of the lines. This type of line padding has been common practice in digital image processing for a long time and is not particular to GPU image processing.

The line step is the number of bytes in a line including the padding. An other way to interpret this number is to say that it is the number of bytes between the first pixel of successive rows in the image, or generally the number of bytes between two neighboring pixels in any column of pixels. The general reason for the existence of the line step it is that uniformly aligned rows of pixel enable optimizations of memory-access patterns.

Even though all functions in NPP will work with arbitrarily aligned images, best performance can only be achieved with well aligned image data. Any image data allocated with the NPP image allocators or the 2D memory allocators in the CUDA runtime, is well aligned. Particularly on older CUDA capable GPUs it is likely that the performance decrease for misaligned data is substantial (orders of magnitude).

All image data passed to NPPI primitives requires a line step to be provided. It is important to keep in mind that this line step is always specified in terms of bytes, not pixels.

4.2.3 Image Data Alignment Requirements

NPP requires pixel data to adhere to certain alignment constraints: For 2 and 4 channel images the following alignment requirement holds: data_pointer % (#channels * sizeof(channel type)) == 0. E.g. a 4 channel image with underlying type Npp8u (8-bit unsigned) would require all pixels to fall on addresses that are multiples of 4 (4 channels * 1 byte size). As a logical consequence of all pixels being aligned to their natural size the image line steps of 2 and 4 channel images also need to be multiples of the pixel size.

1 and 3 channel images only require that pixel pointers are aligned to the underlying data type, i.e. pData% sizof(data type) == 0. And consequentially line steps are also held to this requirement.

#include <npp.h>

#include <stdio.h>

#include <iostream>

# define CUDA( call)  {                                     \

     cudaError err = call;                                                    \

     if( cudaSuccess != err) {                                                \

         printf("Cuda error in file '%s' in line %i : %s.\n",        \

                 __FILE__, __LINE__, cudaGetErrorString( err) );              \

         exit(EXIT_FAILURE);                                                  \

     }																		  \


void 	initImage(Npp8u * data, int width, int height)


	for (int i = 0; i < (width * height); i++){

		data[i] = (Npp8u)(( (float)rand() / RAND_MAX) * 10.0f);



void 	dispImage( char * str, Npp8u * data, int width, int height)


	printf("\n%s\n", str);

	for (int i = 0; i < height; i++){

		for (int j = 0; j < width; j++){

			printf("%3d ", data[ i*width + j ]);





int main(int argc, char* argv[])



	Npp8u * i1_d, * i2_d, *res_d;

	int width = 8;

	int height = 8;

	size_t sizeInBytes =  sizeof( Npp8u) * width * height ; 


	CUDA(cudaMalloc( (void **)&i1_d, sizeof( Npp8u) * width * height));

	CUDA(cudaMalloc( (void **)&i2_d, sizeof( Npp8u) * width * height));

	CUDA(cudaMalloc( (void **)&res_d, sizeof( Npp8u) * width * height));


	Npp8u *ih = (Npp8u *)malloc(sizeof( Npp8u) * width * height);

	initImage( ih, width, height);

	dispImage( "i1_d", ih, width, height);

	CUDA( cudaMemcpy( i1_d, ih,  sizeInBytes, cudaMemcpyHostToDevice));

	initImage( ih, width, height);

	dispImage( "i2_d", ih, width, height);

	CUDA( cudaMemcpy( i2_d, ih,  sizeInBytes, cudaMemcpyHostToDevice));


	memset( ih, 0, sizeInBytes);

	CUDA( cudaMemcpy( res_d, ih,  sizeInBytes, cudaMemcpyHostToDevice));


	NppiSize oSizeROI = {2 , 2};

	NppStatus err = nppiAdd_8u_C1RSfs ( i1_d, width * sizeof( Npp8u), i2_d, width * sizeof( Npp8u),

					   res_d, width * sizeof( Npp8u), oSizeROI, 0);

         if ( err!= NPP_SUCCESS)

		printf("NPP ERRROR\n");

	CUDA( cudaMemcpy(ih, res_d,  sizeInBytes, cudaMemcpyDeviceToHost));

	dispImage( "res_d", ih, width, height);

	return 0;


Result obtained :


  0   5   1   8   5

  4   3   8   8   7

  1   8   7   5   3

  0   0   3   1   1

  9   4   1   0   0


  3   5   5   6   6

  1   6   4   3   0

  6   7   8   5   3

  8   7   9   9   5

  1   4   2   8   2


  3  10   0   0   5

  9   0   0   0   0

  0   0   0   0   0

  0   0   0   0   0

  0   0   0   0   0

Please help!

Thanks in advance :)

Hi Crankie,

I’ve started to look into this. You’re likely hitting on a bug. We do test the arbitrary alignment cases, but our testing is light on very small image sizes.

I have a question about your reproducer and its output. The reproducer is using image of size 8x8. The output however is only showing “images” of 5x5. Was the output produced with 8x8 setting or 5x5 setting?


I reproduced the problem in our 4.0 release branch which confirms that this is a bug.

We are completely rewriting the image arithmetic code for our next release (4.1), mostly because NPP 4.1 will support the complete set of data-types and channel variants for an extensive set of image arithmetic operations. I added tests that will make sure that the bug you’ve discovered will not make it into the new 4.1 code.


Thanks Frank for confirming the bug… I used a 5x5 image in the sample… Sorry for not updating the code before posting.

Hope NPP 4.1 will allow users to pass data allocated using cudaMalloc, as there might be many cases, when user may already have data in global memory and would want to use NPP functions next. It should be possible to pass data to NPP functions that’s not pitched.