High warp serialize when not using smem at all...

Hi all!

I have a little problem with warp serialization here. The strange thing is, I dont use shared memory in this kernel at all. I also cant run out of registers (16k on my GTX285) with my block dims.

Atm Im only using 256 threads per block (gridDim.x = 8, gridDim.y = 8, blockDim.x = 256), what is much less then I should be able to use without running into such problems.

But CUDA profiler spits out the following values:

  • static smem per block: 24

  • registers per thread: 25

  • warp serialize: 12253

Any ideas on what the reason could be? Or better… the solution? :)

My kernel is doing an insertion sort (about 2k independent sets to sort, 24 elements each) and looks somewhat like this:

float setReg[24];

float temp;

// copy data from global memory into regs

for(i = 0; i < 24; i++)

{

	setReg[i] = setsGmem[startElem + i];

}	

// insertion sort

for (i = 1; i < 24; i++)

{

				j = i;

	temp = setReg[i];

	while (j > 0 && setReg[j - 1] > temp)

	{

		setReg[j] = setReg[j - 1];

		j --;

	}

	setReg[j] = temp;

}

// copy data from regs into global memory

for(i = 0; i < 24; i ++)

{

				setsGmem[startElem + i] = setReg[i];

}