hi, I have been struggling to get a very simple even/odd routine working.
The routine is embedded in a more complex sort algorithm, so I though some was wrong in the algorism,
so after weeks of debugging and tweaking all I got is random hung, crashes some time crash so severe that I have no choice but to rebut windows.
so I decide to separate the routine, and I believe I manage to reproduce it, but I can’t debug, because for some reason that I do not understand, adding any debug code to the routine, makes it works.
Also the larger sorting routine, works perfectly when using a bitonic sort. It also work in c++, using pseudo cuda emulation.
below is a extreme simpliyied version, extracted from nvidia pseudo code from so github site.
__global__ void ndCudaEvenOddFail()
{
__shared__ int passes[1024];
__shared__ int sortedRadix[1024];
int threadId = threadIdx.x;
int blockStride = blockDim.x;
sortedRadix[threadId] = blockStride - threadId;
__syncthreads();
int xxx = 0;
do
{
if (threadId < blockStride / 2)
{
passes[threadId] = 0;
}
__syncthreads();
if (threadId < blockStride / 2)
{
int id0 = threadId * 2 + 0;
int id1 = threadId * 2 + 1;
int key0 = sortedRadix[id0];
int key1 = sortedRadix[id1];
if (key1 < key0)
{
sortedRadix[id0] = key1;
sortedRadix[id1] = key0;
passes[threadId] = 1;
}
}
__syncthreads();
if (threadId < (blockStride / 2 - 1))
{
int id0 = threadId * 2 + 1;
int id1 = threadId * 2 + 2;
int key0 = sortedRadix[id0];
int key1 = sortedRadix[id1];
if (key1 < key0)
{
sortedRadix[id0] = key1;
sortedRadix[id1] = key0;
passes[threadId] = 1;
}
}
__syncthreads();
for (int block = blockStride / 4; !passes[0] && block; block >>= 1)
{
if (threadId < block)
{
passes[threadId] += passes[block + threadId];
}
__syncthreads();
}
// any code debug code printf, running in debug, ..., makes the function work.
//if (threadId == 0 && xxx == 0)
//{
// printf("xxxx %d\n", blockIdx.x);
//}
//__syncthreads();
xxx = 1;
// also if I uncomment this, cuda also freezes in CudaLaunchKernel without any error indication.
//__syncthreads();
} while (passes[0]);
}
using a gforce 1660 ultra, calling it with this paratmets
ndCudaEvenOddFail << <22 * 2, 256, 0 >> > ();
the routione runs for a few secuds, but is I call with
ndCudaEvenOddFail << <22 * 3, 256, 0 >> > ();
teh it runs for fewer iteration, calling with
ndCudaEvenOddFail << <22 * 4, 256, 0 >> > ();
does even complete a single iteration.
another interation infor is that If I just uncomnet the last sync before the counter
__syncthreads();|
xxx = 1;
the rountine frezzes in the first iteration.
I have not idea how to debug this, because it is not reporting any error,
the debuger just say CudaLauncKernel and state there on an infinite loop.
again, the same rotine using bitonic sort works perfectly.
Thsi is no wondoes time out, since teh code take less tha 10 micro secund, for a million elements. and even when setting time out to a very large value, these hungs still happens.
The behavior is diffrent in tow GPUs: 1060 and 1660 but both wrong.