I am using nppiFilterRow_32f_C1R to perform convolution , but I get an incorrect result on the border.For simplicity, I have written an example for one line of data. The string contains 10 elements (1.f) and padding (0.f). The kernel consists of 5 elements (1.f). All functions returns NPP_NO_ERROR .
const int input_size=14;
const int output_size=10;
const int kernel_size=5;
int input_size_in_bytes=input_sizesizeof(float);
int output_size_in_bytes=output_sizesizeof(float);
int kernel_size_in_bytes=kernel_size*sizeof(float);
float host_input ={0.f, 0.f, 1.f, 1.f, 1.f, 1.f, 1.f, 1.f, 1.f, 1.f, 1.f, 1.f, 0.f, 0.f}; //with zero padding
float host_kernel ={1.f, 1.f, 1.f, 1.f, 1.f};
float host_output[output_size];
float *dev_input,*dev_output,*dev_kernel;
cudaMalloc(&dev_input,input_size_in_bytes);
cudaMalloc(&dev_output,output_size_in_bytes);
cudaMalloc(&dev_kernel,kernel_size_in_bytes);
//Copy data to device
cudaMemcpy2D(dev_input, input_size_in_bytes, host_input, input_size_in_bytes,
input_size_in_bytes,1,cudaMemcpyHostToDevice);
//Copy kernel to device
cudaMemcpy2D(dev_kernel, kernel_size_in_bytes, host_kernel, kernel_size_in_bytes,
kernel_size_in_bytes,1,cudaMemcpyHostToDevice);
//Filter
int xanchor=kernel_size-1;
NppiSize roi;
roi.width=output_size;
roi.height=1;
nppiFilterRow_32f_C1R(dev_input,input_size_in_bytes,dev_output,output_size_in_bytes,roi,dev_kernel,kernel_size,xanchor);
//Copy result to host
cudaMemcpy2D(host_output, output_size_in_bytes, dev_output, output_size_in_bytes,
output_size_in_bytes,1,cudaMemcpyDeviceToHost);
Thus, at the output I expect (3,4,5,5,5,5,5,5,4,3} but I get (3,4,5,5,5,5,5,5,5,5}
The function was tested using the toolkit version 10.1, 10.2, 11.2. Operating system w10 and ubuntu 20.04