High warp serialize when not using smem at all...

ONeill · February 5, 2010, 2:38pm

Hi all!

I have a little problem with warp serialization here. The strange thing is, I dont use shared memory in this kernel at all. I also cant run out of registers (16k on my GTX285) with my block dims.

Atm Im only using 256 threads per block (gridDim.x = 8, gridDim.y = 8, blockDim.x = 256), what is much less then I should be able to use without running into such problems.

But CUDA profiler spits out the following values:

static smem per block: 24
registers per thread: 25
warp serialize: 12253

Any ideas on what the reason could be? Or better… the solution? :)

My kernel is doing an insertion sort (about 2k independent sets to sort, 24 elements each) and looks somewhat like this:

float setReg[24];

float temp;

// copy data from global memory into regs

for(i = 0; i < 24; i++)

{

	setReg[i] = setsGmem[startElem + i];

}	

// insertion sort

for (i = 1; i < 24; i++)

{

				j = i;

	temp = setReg[i];

	while (j > 0 && setReg[j - 1] > temp)

	{

		setReg[j] = setReg[j - 1];

		j --;

	}

	setReg[j] = temp;

}

// copy data from regs into global memory

for(i = 0; i < 24; i ++)

{

				setsGmem[startElem + i] = setReg[i];

}

Topic		Replies	Views
warp serialize problem CUDA Programming and Performance	2	2556	December 27, 2009
Unknown warp serializing CUDA Programming and Performance	0	997	March 5, 2010
warp serialize CUDA Programming and Performance	1	6172	November 16, 2010
Having problems with warp divergence/serialization profiler: high warp serialize rate although diver CUDA Programming and Performance	4	1743	October 27, 2009
Questions about "warp serialize" and constant memory CUDA Programming and Performance	2	2673	October 26, 2009
cuda profiler reports high warp serialize CUDA Programming and Performance	5	2137	May 14, 2010
about warp serial I got a strange with warp serial CUDA Programming and Performance	3	1445	June 25, 2009
Accounting for warp serialisation CUDA Programming and Performance	5	817	April 8, 2011
How warp serialization works on shared memory How to run a "data[n] += something" efficientl CUDA Programming and Performance	26	3503	May 26, 2010
the problem of cudaProf counters I can't get correct value of "warp serialize" CUDA Programming and Performance	0	1259	April 26, 2010

High warp serialize when not using smem at all...

Related topics