Randomly occurring cudaError_enum cudaErrorUnknown After several iterations, a very simple kernel fa

Hi all!

I’m suffering from a bug I can’t pin down.

I have been working on an algorithm that involves a (very large) complete binary tree (with 2^20 leaf nodes in some of my test cases) . For every node in this hierarchy, a sort must be performed on a subset of the leaf nodes in the hierarchy. This subset can range in size from 2 to the size of all of the leaf nodes. I use the RadixSort class from the SDK to do this. The sort never crashes. However, after the sort, I must update arrays based on its results. (Effectively, the nodes are sorted by one of the coordinate axes of the position associated with them, and after sorting, the order of the other two axis arrays must be updated, since the sort does not affect them).

I use this kernel to do so:

__global__ void _ReAlign(float* x, float* y, float* z, size_t* ref,float* pos, size_t count)

{

	int i = blockDim.x * blockIdx.x + threadIdx.x;

	if(i<count)

	{

		x[i] = pos[3*ref[i]];

		y[i] = pos[3*ref[i] + 1];

		z[i] = pos[3*ref[i] + 2];

	}

}

void ReAlign(float* x, float* y, float* z, size_t* ref,float* pos, size_t count)

{

	int threadsPerBlock = 256;

	int blocksPerGrid = (count + threadsPerBlock - 1) / threadsPerBlock;

	_ReAlign<<<blocksPerGrid, threadsPerBlock>>>(x,y,z,ref,pos,count);

	cudaThreadSynchronize();

}

I then copy the sorted ref back to the host:

cudaMemcpy(_ref+start,d_ref+start,count * sizeof(size_t),cudaMemcpyDeviceToHost);

(all pointers in the parameters are iterators at the same location in each array).

ref stores the original index of an element.

RadixSort sorts pairs of a component (x, y or z) and an entry in ref. ref is then used to move the other components to their new positions. pos is an array of the original order of the positions.

For some reason, this kernel crashes after only processing a few hundred times. It crashes after a different number of executions each time the program executes, however sometimes the number taken to crash is the same for several attempts in a row. It crashes with 4 errors of the form “cudaError_enum at memory location …”. The first one is cudaErrorUnknown. The others are not readable, as they make the program quit before I can catch them (They occur in the Memcpy). Each node in the hierarchy is processed in the same order every time the program runs. I output the iterator locations and the size of count for every call to the kernel, to check that unallocated memory is never called - that is not the cause of the error. The RadixSort runs the same number of times as this simple code, but doesn’t crash, so it is unlikely to be a hardware fault. Any ideas? External Image

Many thanks in advance - this will probably make or break my project!

Maybe you have a bad value in [font=“Courier New”]ref[/font]? I’d try to protect against that:

#define THREADS_PER_BLOCK 256

__device__ size_t bad[THREADS_PER_BLOCK][2];

__global__ void _ReAlign(float* x, float* y, float* z, size_t* ref,float* pos, size_t count)

{

	size_t i = blockDim.x * blockIdx.x + threadIdx.x;

	if (i < count)

	{

		if (ref[i] < count)

		{

			x[i] = pos[3*ref[i]];

			y[i] = pos[3*ref[i] + 1];

			z[i] = pos[3*ref[i] + 2];

		}

		else

		{

			bad[threadIdx.x][0] = i;

			bad[threadIdx.x][1] = ref[i];

		}

	}

}

void ReAlign(float* x, float* y, float* z, size_t* ref,float* pos, size_t count)

{

	int blocksPerGrid = (count + THREADS_PER_BLOCK - 1) / THREADS_PER_BLOCK;

	cudaMemset(bad, 0, sizeof(bad));

	_ReAlign<<<blocksPerGrid, THREADS_PER_BLOCK>>>(x,y,z,ref,pos,count);

	cudaThreadSynchronize();

	printf(_ReAlign returned %s\n", cudaGetErrorString(cudaGetLastError()));

	size_t bad_h[THREADS_PER_BLOCK][2];

	cudaMemcpy(bad_h, bad, sizeof(bad_h), cudaMemcpyDeviceToHost);

	for (int i=0; i < THREADS_PER_BLOCK; i++)

	{

		if (bad[i][0] > 0)

		{

			printf("bad reference ref[%lu] = %lu.\n",

				   (unsigned long) bad_h[i][0],

				   (unsigned long) bad_h[i][1]);

		}

	}

}

I thought this might be appropriate:

Using your code, no errors at all are returned for 1048576 (2^20) leaf nodes or above, apparently magically solving the issue without changing the actions performed. However, any fewer than that (524288 or fewer), and the code crashes the first time cudaMemset(bad, 0, sizeof(bad)) is called, reporting “invalid argument”, despite that line not being affected by the number of leaf nodes at all. So maybe I should just restrict my data to very large sets? :rolleyes:

If I’m not mistaken, that is freaky behaviour. (So it doesn’t seem like ref is the problem). Is this some sort of weird memory issue? How could my code affect that operation?

Yes, that seems freaky. Unless the problem is caused by something before the code we are looking at. Try inserting
[font=“Courier New”] cudaThreadSynchronize();
printf(“before memset: %s\n”, cudaGetErrorString(cudaGetLastError()));[/font]
before the cudaMemset to make sure this is not a delayed error from some earlier kernel invocation.

Unfortunately:

It was worth a try. :(

Edit: I think I will see if there are any driver updates I can get.

P.S. Windows 7 wouldn’t affect this, would it?

Should I stick with the developer drivers?

:">
Well, that’s embarassing. Although I’m not sure what the memset error is about, I know my original error was an access violation. It wasn’t an invalid ref, it was that I had added an offset to pos before giving it to the kernel, when in fact it should always be referred to from the start of the array, as ref stores absolute indexes. I can’t believe I didn’t spot it sooner. The only way I spotted it was when I rewrote the code to run on the CPU, which through up the access violation more clearly than CUDA does. Sorry for wasting your time.