I was trying out NPP v4.0 bundled with CUDA 4.0 RC, —
— In the code shown below, I do a simple cudaMalloc for a 5x5 Npp8u matrix data and use nppiAdd_8u_C1RSfs to do a ROI addition between two images of size 5x5 for ROI size { 2,2 }. Result is not as expected.
The NPP documentation says that NPP API functions will work even if the image is not well-aligned. Also, the documentation specifies Image Data Alignment Requirements which I think this code snippet satisfies.
4.2.1 Line Step
The line step (also called “line stride” or “row step”) allows lines of oddly sized images to start on wellaligned addresses by adding a number of unused bytes at the ends of the lines. This type of line padding has been common practice in digital image processing for a long time and is not particular to GPU image processing.
The line step is the number of bytes in a line including the padding. An other way to interpret this number is to say that it is the number of bytes between the first pixel of successive rows in the image, or generally the number of bytes between two neighboring pixels in any column of pixels. The general reason for the existence of the line step it is that uniformly aligned rows of pixel enable optimizations of memory-access patterns.
Even though all functions in NPP will work with arbitrarily aligned images, best performance can only be achieved with well aligned image data. Any image data allocated with the NPP image allocators or the 2D memory allocators in the CUDA runtime, is well aligned. Particularly on older CUDA capable GPUs it is likely that the performance decrease for misaligned data is substantial (orders of magnitude).
All image data passed to NPPI primitives requires a line step to be provided. It is important to keep in mind that this line step is always specified in terms of bytes, not pixels.
4.2.3 Image Data Alignment Requirements
NPP requires pixel data to adhere to certain alignment constraints: For 2 and 4 channel images the following alignment requirement holds: data_pointer % (#channels * sizeof(channel type)) == 0. E.g. a 4 channel image with underlying type Npp8u (8-bit unsigned) would require all pixels to fall on addresses that are multiples of 4 (4 channels * 1 byte size). As a logical consequence of all pixels being aligned to their natural size the image line steps of 2 and 4 channel images also need to be multiples of the pixel size.
1 and 3 channel images only require that pixel pointers are aligned to the underlying data type, i.e. pData% sizof(data type) == 0. And consequentially line steps are also held to this requirement.
#include <npp.h>
#include <stdio.h>
#include <iostream>
# define CUDA( call) { \
cudaError err = call; \
if( cudaSuccess != err) { \
printf("Cuda error in file '%s' in line %i : %s.\n", \
__FILE__, __LINE__, cudaGetErrorString( err) ); \
exit(EXIT_FAILURE); \
} \
}
void initImage(Npp8u * data, int width, int height)
{
for (int i = 0; i < (width * height); i++){
data[i] = (Npp8u)(( (float)rand() / RAND_MAX) * 10.0f);
}
}
void dispImage( char * str, Npp8u * data, int width, int height)
{
printf("\n%s\n", str);
for (int i = 0; i < height; i++){
for (int j = 0; j < width; j++){
printf("%3d ", data[ i*width + j ]);
}
printf("\n");
}
}
int main(int argc, char* argv[])
{
Npp8u * i1_d, * i2_d, *res_d;
int width = 8;
int height = 8;
size_t sizeInBytes = sizeof( Npp8u) * width * height ;
CUDA(cudaMalloc( (void **)&i1_d, sizeof( Npp8u) * width * height));
CUDA(cudaMalloc( (void **)&i2_d, sizeof( Npp8u) * width * height));
CUDA(cudaMalloc( (void **)&res_d, sizeof( Npp8u) * width * height));
Npp8u *ih = (Npp8u *)malloc(sizeof( Npp8u) * width * height);
initImage( ih, width, height);
dispImage( "i1_d", ih, width, height);
CUDA( cudaMemcpy( i1_d, ih, sizeInBytes, cudaMemcpyHostToDevice));
initImage( ih, width, height);
dispImage( "i2_d", ih, width, height);
CUDA( cudaMemcpy( i2_d, ih, sizeInBytes, cudaMemcpyHostToDevice));
memset( ih, 0, sizeInBytes);
CUDA( cudaMemcpy( res_d, ih, sizeInBytes, cudaMemcpyHostToDevice));
NppiSize oSizeROI = {2 , 2};
NppStatus err = nppiAdd_8u_C1RSfs ( i1_d, width * sizeof( Npp8u), i2_d, width * sizeof( Npp8u),
res_d, width * sizeof( Npp8u), oSizeROI, 0);
if ( err!= NPP_SUCCESS)
printf("NPP ERRROR\n");
CUDA( cudaMemcpy(ih, res_d, sizeInBytes, cudaMemcpyDeviceToHost));
dispImage( "res_d", ih, width, height);
return 0;
}
Result obtained :
i1_d
0 5 1 8 5
4 3 8 8 7
1 8 7 5 3
0 0 3 1 1
9 4 1 0 0
i2_d
3 5 5 6 6
1 6 4 3 0
6 7 8 5 3
8 7 9 9 5
1 4 2 8 2
res_d
3 10 0 0 5
9 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
Please help!
Thanks in advance :)