warp serialize problem

I have a small kernel that is having some warp serialize issue. The kernel is as follow (the excerpt of the real code, only for performance tuning purpose):

GTX 280

__global__ void

cuda_my_kernel (float *bV2, float *T, float *bC1, float *bC2, int IB, int NB, int BB, int k)

{

	int i, j=0, kk;

	__shared__ float V2[16][16];

	__shared__ float C2[16][16];

	__shared__ float C1[16][16];

	

	// Thread index

	int tx = threadIdx.x;

	int ty = threadIdx.y;

	

	float *mbC2 = bC2;

	float *mbV2 = bV2;

	

	float _c1;

	// make a copy of C1

	_c1 = C1[ty][tx];

	float value = 0;

	#pragma unroll

	for (kk=0; kk<NB; kk++)

		value += V2[kk][ty]*C2[tx][kk];

			

	C1[ty][tx] = value;

	__syncthreads();

}

According to the visual profiler, the above code has 4032 warp serialize when fired up using

cuda_my_kernel<<<dimGrid,dimBlock>>>(d_V2, d_T, d_C1, d_C2, IB, NB, BB, k);

where dimGrid is (1,29) and dimBlock is (16, 16), NB=IB=BB=16, k=1;

now if

C1[ty][tx] = value;

is replaced with

C1[ty][tx] = 1;

then no warp serialize is reported. Anyone any idea why the variable ‘value’ is causing this much problem?

Thanks!

oh, another question is that for a shared memory like C1[16][16], why does

C1[ty][tx]=1;

cause no warp serialize but

C1[tx][ty]=1;

cause 241 warp serialize.

Thanks again!

Okay, the answer to the second question might be the following:
C1[16][16] is stored ‘row major’, so half warp doing C1[ty][tx] is hitting all the element in row ty which matches just well with the 16 banks. On the other hand, half warp doing C1[tx][ty] is hitting all the elements in column ty and all elements in this column are in the same bank (???) which leads to 16 serialization due to bank conflict and in total this is 256 (close to 241 as observed by the profiler) serialized warps. Correct me if I’m wrong.

thanks.

alright, got the answer for variable ‘value’. same reason.