I need to coalesce my code, so I have to transfer 32 bits in one time. The best way is to use the uchar4 struct provided by NVIDIA.

My original code was all float and worked perfect for my RowConvolutionFilter, but because I surpassed the amount of shared memory in my ColumnConvolutionKernel I had to work with unsigned chars…

Thing is that after changing all this I got a slight different output, so it must be a wrong implementation…

Original code:

float* pixelOutput = (float *)(surfaceOutput + y*pitchOutput) + 4*x;

		pixelOutput[0] = s_data_R;

		pixelOutput[1] = s_data_G;

		pixelOutput[2] = s_data_B;

		pixelOutput[3] = 1.0;

Current code:

unsigned char *pixelOutput = (surfaceOutput+y*pitchOutput)+4*x;


for (int i=threadIdx.x; i<BLOCK_DIM+2*KERNEL_RADIUS; i+=BLOCK_DIM) 


		int gx = blockIdx.x * blockDim.x - KERNEL_RADIUS + i;

		pixelOutput[0]=s_data_R; //R

		pixelOutput[1]=s_data_G; //G

		pixelOutput[2]=s_data_B; //B

		pixelOutput[3]=1.0; //A



Can anybody tell me what could be wrong? Also I would like the original code in one uchar4 instead of 4 unsigned chars…

Thanks in advance!