Hello folks!
I am doing some image processing work, and have a strange behavior (at least strange to me).
__shared__ unsigned int filterOut[BLOCK_SIZE];
// ...
// outputColor is calculated from shared memory above
// ...
// TColor is just unsigned int as in CUDA SDK Image Denoising example
TColor unsgndIntColor = make_color(outputColor.x,outputColor.y,outputColor.z,0);
// No problem with the frame rate until here.
// When I do the following assignment, the frame rate halves.
filterOut[threadIdx.x + __mul24(BLOCKDIM_X,threadIdx.y)] = unsgndIntColor;
When I omit the assignment or assign it another random value like “threadIdx.x” code runs normally. So, only one assignment degrades the performance. I am sure that I am missing something, but cannot figure out. Can anyone make a guess out of it?