Weird performance decrease

Hello folks!

I am doing some image processing work, and have a strange behavior (at least strange to me).

__shared__ unsigned int filterOut[BLOCK_SIZE];

// ...

// outputColor is calculated from shared memory above

// ...

// TColor is just unsigned int as in CUDA SDK Image Denoising example

TColor unsgndIntColor = make_color(outputColor.x,outputColor.y,outputColor.z,0);

// No problem with the frame rate until here.

// When I do the following assignment, the frame rate halves.

filterOut[threadIdx.x + __mul24(BLOCKDIM_X,threadIdx.y)] = unsgndIntColor;

When I omit the assignment or assign it another random value like “threadIdx.x” code runs normally. So, only one assignment degrades the performance. I am sure that I am missing something, but cannot figure out. Can anyone make a guess out of it?

The compiler is optimising out make_color() when you don’t use the returned value for anything.

I see, was a stupid question. Thanks anyway!