I have a small kernel that is having some warp serialize issue. The kernel is as follow (the excerpt of the real code, only for performance tuning purpose):
GTX 280
__global__ void
cuda_my_kernel (float *bV2, float *T, float *bC1, float *bC2, int IB, int NB, int BB, int k)
{
int i, j=0, kk;
__shared__ float V2[16][16];
__shared__ float C2[16][16];
__shared__ float C1[16][16];
// Thread index
int tx = threadIdx.x;
int ty = threadIdx.y;
float *mbC2 = bC2;
float *mbV2 = bV2;
float _c1;
// make a copy of C1
_c1 = C1[ty][tx];
float value = 0;
#pragma unroll
for (kk=0; kk<NB; kk++)
value += V2[kk][ty]*C2[tx][kk];
C1[ty][tx] = value;
__syncthreads();
}
According to the visual profiler, the above code has 4032 warp serialize when fired up using
cuda_my_kernel<<<dimGrid,dimBlock>>>(d_V2, d_T, d_C1, d_C2, IB, NB, BB, k);
where dimGrid is (1,29) and dimBlock is (16, 16), NB=IB=BB=16, k=1;
now if
C1[ty][tx] = value;
is replaced with
C1[ty][tx] = 1;
then no warp serialize is reported. Anyone any idea why the variable ‘value’ is causing this much problem?
Thanks!
oh, another question is that for a shared memory like C1[16][16], why does
C1[ty][tx]=1;
cause no warp serialize but
C1[tx][ty]=1;
cause 241 warp serialize.
Thanks again!