warp serialize problem

dgetrf · December 27, 2009, 2:13am

I have a small kernel that is having some warp serialize issue. The kernel is as follow (the excerpt of the real code, only for performance tuning purpose):

GTX 280

__global__ void

cuda_my_kernel (float *bV2, float *T, float *bC1, float *bC2, int IB, int NB, int BB, int k)

{

	int i, j=0, kk;

	__shared__ float V2[16][16];

	__shared__ float C2[16][16];

	__shared__ float C1[16][16];

	

	// Thread index

	int tx = threadIdx.x;

	int ty = threadIdx.y;

	

	float *mbC2 = bC2;

	float *mbV2 = bV2;

	

	float _c1;

	// make a copy of C1

	_c1 = C1[ty][tx];

	float value = 0;

	#pragma unroll

	for (kk=0; kk<NB; kk++)

		value += V2[kk][ty]*C2[tx][kk];

			

	C1[ty][tx] = value;

	__syncthreads();

}

According to the visual profiler, the above code has 4032 warp serialize when fired up using

cuda_my_kernel<<<dimGrid,dimBlock>>>(d_V2, d_T, d_C1, d_C2, IB, NB, BB, k);

where dimGrid is (1,29) and dimBlock is (16, 16), NB=IB=BB=16, k=1;

now if

C1[ty][tx] = value;

is replaced with

C1[ty][tx] = 1;

then no warp serialize is reported. Anyone any idea why the variable ‘value’ is causing this much problem?

Thanks!

oh, another question is that for a shared memory like C1[16][16], why does

C1[ty][tx]=1;

cause no warp serialize but

C1[tx][ty]=1;

cause 241 warp serialize.

Thanks again!

dgetrf · December 27, 2009, 2:28am

Okay, the answer to the second question might be the following:
C1[16][16] is stored ‘row major’, so half warp doing C1[ty][tx] is hitting all the element in row ty which matches just well with the 16 banks. On the other hand, half warp doing C1[tx][ty] is hitting all the elements in column ty and all elements in this column are in the same bank (???) which leads to 16 serialization due to bank conflict and in total this is 256 (close to 241 as observed by the profiler) serialized warps. Correct me if I’m wrong.

thanks.

dgetrf · December 27, 2009, 3:16am

alright, got the answer for variable ‘value’. same reason.

Topic		Replies	Views
Having problems with warp divergence/serialization profiler: high warp serialize rate although diver CUDA Programming and Performance	4	1743	October 27, 2009
How warp serialization works on shared memory How to run a "data[n] += something" efficientl CUDA Programming and Performance	26	3503	May 26, 2010
cuda profiler reports high warp serialize CUDA Programming and Performance	5	2137	May 14, 2010
Warp serialization problem: help me CUDA Programming and Performance	20	13560	December 29, 2009
Unknown warp serializing CUDA Programming and Performance	0	997	March 5, 2010
Warp serialization CUDA Programming and Performance	1	8916	January 30, 2009
Bank Conflicts and Serialized Warps CUDA Programming and Performance	6	7913	December 4, 2009
Questions about "warp serialize" and constant memory CUDA Programming and Performance	2	2673	October 26, 2009
Question on avoiding warp serialization... CUDA Programming and Performance	0	943	October 1, 2009
about warp serial I got a strange with warp serial CUDA Programming and Performance	3	1445	June 25, 2009

warp serialize problem

Related topics