Cuda malfunctions

hi, I have been struggling to get a very simple even/odd routine working.
The routine is embedded in a more complex sort algorithm, so I though some was wrong in the algorism,
so after weeks of debugging and tweaking all I got is random hung, crashes some time crash so severe that I have no choice but to rebut windows.
so I decide to separate the routine, and I believe I manage to reproduce it, but I can’t debug, because for some reason that I do not understand, adding any debug code to the routine, makes it works.

Also the larger sorting routine, works perfectly when using a bitonic sort. It also work in c++, using pseudo cuda emulation.
below is a extreme simpliyied version, extracted from nvidia pseudo code from so github site.

__global__ void ndCudaEvenOddFail()
{
	__shared__  int passes[1024];
	__shared__  int sortedRadix[1024];

	int threadId = threadIdx.x;
	int blockStride = blockDim.x;

	sortedRadix[threadId] = blockStride - threadId;
	__syncthreads();

int xxx = 0;
	do
	{
		if (threadId < blockStride / 2)
		{
			passes[threadId] = 0;
		}
		__syncthreads();

		if (threadId < blockStride / 2)
		{
			int id0 = threadId * 2 + 0;
			int id1 = threadId * 2 + 1;
			int key0 = sortedRadix[id0];
			int key1 = sortedRadix[id1];
			if (key1 < key0)
			{
				sortedRadix[id0] = key1;
				sortedRadix[id1] = key0;
				passes[threadId] = 1;
			}
		}
		__syncthreads();

		if (threadId < (blockStride / 2 - 1))
		{
			int id0 = threadId * 2 + 1;
			int id1 = threadId * 2 + 2;
			int key0 = sortedRadix[id0];
			int key1 = sortedRadix[id1];
			if (key1 < key0)
			{
				sortedRadix[id0] = key1;
				sortedRadix[id1] = key0;
				passes[threadId] = 1;
			}
		}
		__syncthreads();

		for (int block = blockStride / 4; !passes[0] && block; block >>= 1)
		{
			if (threadId < block)
			{
				passes[threadId] += passes[block + threadId];
			}
			__syncthreads();
		}

		// any code debug code printf, running in debug, ..., makes the function work. 
		//if (threadId == 0 && xxx == 0)
		//{
		//	printf("xxxx %d\n", blockIdx.x);
		//}
		//__syncthreads();
		xxx = 1;
		
		// also if I uncomment this, cuda also freezes in CudaLaunchKernel without any error indication. 
		//__syncthreads();

	} while (passes[0]);
}

using a gforce 1660 ultra, calling it with this paratmets
ndCudaEvenOddFail << <22 * 2, 256, 0 >> > ();

the routione runs for a few secuds, but is I call with
ndCudaEvenOddFail << <22 * 3, 256, 0 >> > ();

teh it runs for fewer iteration, calling with
ndCudaEvenOddFail << <22 * 4, 256, 0 >> > ();

does even complete a single iteration.
another interation infor is that If I just uncomnet the last sync before the counter

__syncthreads();|
xxx = 1;

the rountine frezzes in the first iteration.

I have not idea how to debug this, because it is not reporting any error,
the debuger just say CudaLauncKernel and state there on an infinite loop.
again, the same rotine using bitonic sort works perfectly.

Thsi is no wondoes time out, since teh code take less tha 10 micro secund, for a million elements. and even when setting time out to a very large value, these hungs still happens.
The behavior is diffrent in tow GPUs: 1060 and 1660 but both wrong.

In case anyone wonders about the code, This is a straight translation of a bubble sort(even/odd), code that can be found in places like wikipedia:
Odd–even sort - Wikipedia.

the particallar implementation I listed above, is a naive on porpuse, almost literal translation of the wikipidia pseudo code to expose the malfunction.
Not matter what I do, I can’t get it to work in release mode without adding debug code like printf.

It also works in debug. so I figure I most have some race condition,
but even adding __syncthread() causes the GPU hung in release mode.
Thanks

I don’t have any trouble running your posted code on a GTX 1660Ti on CUDA 12.0:

$ cat t31.cu
#include <cstdio>

__global__ void ndCudaEvenOddFail()
{
        __shared__  int passes[1024];
        __shared__  int sortedRadix[1024];

        int threadId = threadIdx.x;
        int blockStride = blockDim.x;

        sortedRadix[threadId] = blockStride - threadId;
        __syncthreads();

int xxx = 0;
        do
        {
                if (threadId < blockStride / 2)
                {
                        passes[threadId] = 0;
                }
                __syncthreads();

                if (threadId < blockStride / 2)
                {
                        int id0 = threadId * 2 + 0;
                        int id1 = threadId * 2 + 1;
                        int key0 = sortedRadix[id0];
                        int key1 = sortedRadix[id1];
                        if (key1 < key0)
                        {
                                sortedRadix[id0] = key1;
                                sortedRadix[id1] = key0;
                                passes[threadId] = 1;
                        }
                }
                __syncthreads();

                if (threadId < (blockStride / 2 - 1))
                {
                        int id0 = threadId * 2 + 1;
                        int id1 = threadId * 2 + 2;
                        int key0 = sortedRadix[id0];
                        int key1 = sortedRadix[id1];
                        if (key1 < key0)
                        {
                                sortedRadix[id0] = key1;
                                sortedRadix[id1] = key0;
                                passes[threadId] = 1;
                        }
                }
                __syncthreads();

                for (int block = blockStride / 4; !passes[0] && block; block >>= 1)
                {
                        if (threadId < block)
                        {
                                passes[threadId] += passes[block + threadId];
                        }
                        __syncthreads();
                }

                // any code debug code printf, running in debug, ..., makes the function work.
                //if (threadId == 0 && xxx == 0)
                //{
                //      printf("xxxx %d\n", blockIdx.x);
                //}
                //__syncthreads();
                xxx = 1;

                // also if I uncomment this, cuda also freezes in CudaLaunchKernel without any error indication.
                //__syncthreads();

        } while (passes[0]);
}

int main(){

  ndCudaEvenOddFail << <22 * 4, 256, 0 >> > ();
  cudaDeviceSynchronize();
}
$ nvcc -o t31 t31.cu -arch=sm_75
t31.cu(14): warning #550-D: variable "xxx" was set but never used

Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"

$ compute-sanitizer ./t31
========= COMPUTE-SANITIZER
========= ERROR SUMMARY: 0 errors
$

Like you said, it runs in less than a second. (For those who may be wondering, curiously, the kernel is not an “empty” kernel.)

With the code posted by @Robert_Crovella (I tried both with and without the __syncthreads() call after xxx=1):

Windows 7, CUDA 9.5, Quadro K420, sm_30, kernel run time 1.2921ms
Windows 10, CUDA 11.8, Quadro P2000, sm_61, kernel run time 224.51us
Windows 10, CUDA 11.8, Quadro RTX4000, sm_75, kernel run time 64.288us

So far, OP’s issue is not reproducible. It is always possible that issues only occur with a particular CUDA / driver combination, or that a CUDA installation has become corrupted. I would suggest trying with the latest available driver and toolchain.

oh thanks for the quick reply.
so far, it seems you were right, I have about 5 or 6 versions installed, about 6 or 7 year worth of sdks.
so I went and removed all nvidia software, incdeling driver: 516.04
they I install latest new driver 531.18
and latest cuda sdk 12.1

now the project build says:

The CUDA compiler identification is NVIDIA 12.1.66
Detecting CUDA compiler ABI info
Detecting CUDA compiler ABI info - done
Check for working CUDA compiler: C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v12.1/bin/nvcc.exe - skipped
Detecting CUDA compile features
Detecting CUDA compile features - done

I just did a quick compile and build. Then, I ran it for a while, and did not hang.

It still does not answer how the version that uses bitonic sort instead of even odd, works, but I will not lose sleep over that.

So I said that so far this is working.

I will have to resume working on this next weekend, but I think you can close this question.

Thanks for the advice.
Julio

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.