Hi all!
I have a little problem with warp serialization here. The strange thing is, I dont use shared memory in this kernel at all. I also cant run out of registers (16k on my GTX285) with my block dims.
Atm Im only using 256 threads per block (gridDim.x = 8, gridDim.y = 8, blockDim.x = 256), what is much less then I should be able to use without running into such problems.
But CUDA profiler spits out the following values:
-
static smem per block: 24
-
registers per thread: 25
-
warp serialize: 12253
Any ideas on what the reason could be? Or better… the solution? :)
My kernel is doing an insertion sort (about 2k independent sets to sort, 24 elements each) and looks somewhat like this:
float setReg[24];
float temp;
// copy data from global memory into regs
for(i = 0; i < 24; i++)
{
setReg[i] = setsGmem[startElem + i];
}
// insertion sort
for (i = 1; i < 24; i++)
{
j = i;
temp = setReg[i];
while (j > 0 && setReg[j - 1] > temp)
{
setReg[j] = setReg[j - 1];
j --;
}
setReg[j] = temp;
}
// copy data from regs into global memory
for(i = 0; i < 24; i ++)
{
setsGmem[startElem + i] = setReg[i];
}