CUDA 4.0 NPP giving wrong answers : NPP bug possibly ?

Crankie · April 8, 2011, 4:55am

I was trying out NPP v4.0 bundled with CUDA 4.0 RC, —

— In the code shown below, I do a simple cudaMalloc for a 5x5 Npp8u matrix data and use nppiAdd_8u_C1RSfs to do a ROI addition between two images of size 5x5 for ROI size { 2,2 }. Result is not as expected.

The NPP documentation says that NPP API functions will work even if the image is not well-aligned. Also, the documentation specifies Image Data Alignment Requirements which I think this code snippet satisfies.

4.2.1 Line Step

The line step (also called “line stride” or “row step”) allows lines of oddly sized images to start on wellaligned addresses by adding a number of unused bytes at the ends of the lines. This type of line padding has been common practice in digital image processing for a long time and is not particular to GPU image processing.

The line step is the number of bytes in a line including the padding. An other way to interpret this number is to say that it is the number of bytes between the first pixel of successive rows in the image, or generally the number of bytes between two neighboring pixels in any column of pixels. The general reason for the existence of the line step it is that uniformly aligned rows of pixel enable optimizations of memory-access patterns.

Even though all functions in NPP will work with arbitrarily aligned images, best performance can only be achieved with well aligned image data. Any image data allocated with the NPP image allocators or the 2D memory allocators in the CUDA runtime, is well aligned. Particularly on older CUDA capable GPUs it is likely that the performance decrease for misaligned data is substantial (orders of magnitude).

All image data passed to NPPI primitives requires a line step to be provided. It is important to keep in mind that this line step is always specified in terms of bytes, not pixels.

4.2.3 Image Data Alignment Requirements

NPP requires pixel data to adhere to certain alignment constraints: For 2 and 4 channel images the following alignment requirement holds: data_pointer % (#channels * sizeof(channel type)) == 0. E.g. a 4 channel image with underlying type Npp8u (8-bit unsigned) would require all pixels to fall on addresses that are multiples of 4 (4 channels * 1 byte size). As a logical consequence of all pixels being aligned to their natural size the image line steps of 2 and 4 channel images also need to be multiples of the pixel size.

1 and 3 channel images only require that pixel pointers are aligned to the underlying data type, i.e. pData% sizof(data type) == 0. And consequentially line steps are also held to this requirement.

#include <npp.h>

#include <stdio.h>

#include <iostream>

# define CUDA( call)  {                                     \

     cudaError err = call;                                                    \

     if( cudaSuccess != err) {                                                \

         printf("Cuda error in file '%s' in line %i : %s.\n",        \

                 __FILE__, __LINE__, cudaGetErrorString( err) );              \

         exit(EXIT_FAILURE);                                                  \

     }																		  \

}

void 	initImage(Npp8u * data, int width, int height)

{

	for (int i = 0; i < (width * height); i++){

		data[i] = (Npp8u)(( (float)rand() / RAND_MAX) * 10.0f);

	}

}

void 	dispImage( char * str, Npp8u * data, int width, int height)

{

	printf("\n%s\n", str);

	for (int i = 0; i < height; i++){

		for (int j = 0; j < width; j++){

			printf("%3d ", data[ i*width + j ]);

		}

		printf("\n");

	}

}

int main(int argc, char* argv[])

{

	

	Npp8u * i1_d, * i2_d, *res_d;

	int width = 8;

	int height = 8;

	size_t sizeInBytes =  sizeof( Npp8u) * width * height ; 

	

	CUDA(cudaMalloc( (void **)&i1_d, sizeof( Npp8u) * width * height));

	CUDA(cudaMalloc( (void **)&i2_d, sizeof( Npp8u) * width * height));

	CUDA(cudaMalloc( (void **)&res_d, sizeof( Npp8u) * width * height));

	

	Npp8u *ih = (Npp8u *)malloc(sizeof( Npp8u) * width * height);

	initImage( ih, width, height);

	dispImage( "i1_d", ih, width, height);

	CUDA( cudaMemcpy( i1_d, ih,  sizeInBytes, cudaMemcpyHostToDevice));

	initImage( ih, width, height);

	dispImage( "i2_d", ih, width, height);

	CUDA( cudaMemcpy( i2_d, ih,  sizeInBytes, cudaMemcpyHostToDevice));

	

	memset( ih, 0, sizeInBytes);

	CUDA( cudaMemcpy( res_d, ih,  sizeInBytes, cudaMemcpyHostToDevice));

	

	NppiSize oSizeROI = {2 , 2};

	NppStatus err = nppiAdd_8u_C1RSfs ( i1_d, width * sizeof( Npp8u), i2_d, width * sizeof( Npp8u),

					   res_d, width * sizeof( Npp8u), oSizeROI, 0);

         if ( err!= NPP_SUCCESS)

		printf("NPP ERRROR\n");

	CUDA( cudaMemcpy(ih, res_d,  sizeInBytes, cudaMemcpyDeviceToHost));

	dispImage( "res_d", ih, width, height);

	return 0;

}

Result obtained :

i1_d

  0   5   1   8   5

  4   3   8   8   7

  1   8   7   5   3

  0   0   3   1   1

  9   4   1   0   0

i2_d

  3   5   5   6   6

  1   6   4   3   0

  6   7   8   5   3

  8   7   9   9   5

  1   4   2   8   2

res_d

  3  10   0   0   5

  9   0   0   0   0

  0   0   0   0   0

  0   0   0   0   0

  0   0   0   0   0

Please help!

Thanks in advance :)

Frank_Jargstorff · April 13, 2011, 11:51pm

Hi Crankie,

I’ve started to look into this. You’re likely hitting on a bug. We do test the arbitrary alignment cases, but our testing is light on very small image sizes.

I have a question about your reproducer and its output. The reproducer is using image of size 8x8. The output however is only showing “images” of 5x5. Was the output produced with 8x8 setting or 5x5 setting?

–Frank

Frank_Jargstorff · April 19, 2011, 5:42pm

I reproduced the problem in our 4.0 release branch which confirms that this is a bug.

We are completely rewriting the image arithmetic code for our next release (4.1), mostly because NPP 4.1 will support the complete set of data-types and channel variants for an extensive set of image arithmetic operations. I added tests that will make sure that the bug you’ve discovered will not make it into the new 4.1 code.

–Frank

Crankie · April 20, 2011, 5:41am

Thanks Frank for confirming the bug… I used a 5x5 image in the sample… Sorry for not updating the code before posting.

Hope NPP 4.1 will allow users to pass data allocated using cudaMalloc, as there might be many cases, when user may already have data in global memory and would want to use NPP functions next. It should be possible to pass data to NPP functions that’s not pitched.

Thanks

Topic		Replies	Views
NPP image pitch padding CUDA Programming and Performance	4	5899	February 19, 2010
nppiRotate_8u_C1R and NPP_STEP_ERROR CUDA Programming and Performance	10	5611	March 8, 2012
NPP - nppiFilter_8u_C1R returns KERNEL_EXECUTION Debug options? CUDA Programming and Performance	6	6751	April 25, 2010
NPP Integral Image CUDA Programming and Performance	5	3349	June 22, 2011
Calling NPP helper with large image gives kernel execution error GPU-Accelerated Libraries npp	3	1815	November 11, 2021
Cuda - NPP Developers.. CUDA Programming and Performance	10	5679	August 2, 2010
NPP function nppiCrossCorrFull_NormLevel_8u32f_C1R too slow??? CUDA Programming and Performance	8	1521	March 7, 2015
Very poor performance with NPP CrossCorrValid GPU-Accelerated Libraries npp	8	3296	May 25, 2022
Using nppiResizeBatch_8u_C3R causes exception wrap illegal address GPU-Accelerated Libraries npp	3	806	August 24, 2022
Use NPP library for JPEG DCT GPU-Accelerated Libraries	9	20424	July 10, 2019

CUDA 4.0 NPP giving wrong answers : NPP bug possibly ?

Related topics