Possible compiler bug in CUDA 2.3

I’ve written a transpose function similar to the one in the SDK but where the source and destination pitches can be specified. The version below seems to work as intended but if I substitute the commented out version of the second loop then only the results from the first iteration of that loop are correct. The strange thing is that I’m using an identical construction in the first loop and it doesn’t seem to be causing any problems there. I’m having the same problem on a GTX250 under Windows 7 64-bit and on a GTX260 under Windows XP Pro 64-bit although in both cases I’m compiling as 32-bit.

[codebox]global void kernelFwdTranspose(int sizex, int sizey, unsigned int src, unsigned int srcpitch, unsigned int dst, unsigned int dstpitch)

{

// This kernel performs calculates the transpose for block of 32 x 32 output pixels using a block of 32 x 8 threads

// Notes:

//    src must be of size (sizex, sizey) pixels with a pitch of srcpitch bytes

//    dst must be of size (sizey, sizex) pixels with a pitch of dstpitch bytes

//    dstpitch should be selected carefully to avoid performance problems due to "partition camping"

__shared__ float shared[33 * 32];

{

	unsigned int x = blockIdx.x * 32 + threadIdx.x;

	if(x < sizex)

	{

		unsigned int index1 = threadIdx.y * 33 + threadIdx.x;

		unsigned int y = blockIdx.y * 32 + threadIdx.y;

		unsigned int maxy = min(blockIdx.y * 32 + 32, sizey);

		while(y < maxy)

		{

			unsigned int index3 = __umul24(y, srcpitch) + x * 4;

			shared[index1] = *(float *)((char *)src + index3);

			index1 += 8 * 33;

			y += 8;

		}

	}

}

__syncthreads();

{

	unsigned int x = blockIdx.y * 32 + threadIdx.x;

	if(x < sizey)

	{

		unsigned int index1 = threadIdx.x * 33 + threadIdx.y;

		/*unsigned int y = blockIdx.x * 32 + threadIdx.y;

		unsigned int maxy = min(blockIdx.x * 32 + 32, sizex);

		while(y < maxy)

		{

			unsigned int index3 = __umul24(y, dstpitch) + x * 4;

			*(float *)((char *)dst + index3) = shared[index1];

			index1 += 8;

			y += 8;

		}*/

		unsigned int y = blockIdx.x * 32 + threadIdx.y;

		unsigned int maxy = blockIdx.x * 32 + 32;

		while(y < sizex && y < maxy)

		{

			unsigned int index3 = __umul24(y, dstpitch) + x * 4;

			*(float *)((char *)dst + index3) = shared[index1];

			index1 += 8;

			y += 8;

		}

	}

}

}[/codebox]

I’ve edited the above post because for some reason not all of the code was being displayed.