Hi all!
I’m suffering from a bug I can’t pin down.
I have been working on an algorithm that involves a (very large) complete binary tree (with 2^20 leaf nodes in some of my test cases) . For every node in this hierarchy, a sort must be performed on a subset of the leaf nodes in the hierarchy. This subset can range in size from 2 to the size of all of the leaf nodes. I use the RadixSort class from the SDK to do this. The sort never crashes. However, after the sort, I must update arrays based on its results. (Effectively, the nodes are sorted by one of the coordinate axes of the position associated with them, and after sorting, the order of the other two axis arrays must be updated, since the sort does not affect them).
I use this kernel to do so:
__global__ void _ReAlign(float* x, float* y, float* z, size_t* ref,float* pos, size_t count)
{
int i = blockDim.x * blockIdx.x + threadIdx.x;
if(i<count)
{
x[i] = pos[3*ref[i]];
y[i] = pos[3*ref[i] + 1];
z[i] = pos[3*ref[i] + 2];
}
}
void ReAlign(float* x, float* y, float* z, size_t* ref,float* pos, size_t count)
{
int threadsPerBlock = 256;
int blocksPerGrid = (count + threadsPerBlock - 1) / threadsPerBlock;
_ReAlign<<<blocksPerGrid, threadsPerBlock>>>(x,y,z,ref,pos,count);
cudaThreadSynchronize();
}
I then copy the sorted ref back to the host:
cudaMemcpy(_ref+start,d_ref+start,count * sizeof(size_t),cudaMemcpyDeviceToHost);
(all pointers in the parameters are iterators at the same location in each array).
ref stores the original index of an element.
RadixSort sorts pairs of a component (x, y or z) and an entry in ref. ref is then used to move the other components to their new positions. pos is an array of the original order of the positions.
For some reason, this kernel crashes after only processing a few hundred times. It crashes after a different number of executions each time the program executes, however sometimes the number taken to crash is the same for several attempts in a row. It crashes with 4 errors of the form “cudaError_enum at memory location …”. The first one is cudaErrorUnknown. The others are not readable, as they make the program quit before I can catch them (They occur in the Memcpy). Each node in the hierarchy is processed in the same order every time the program runs. I output the iterator locations and the size of count for every call to the kernel, to check that unallocated memory is never called - that is not the cause of the error. The RadixSort runs the same number of times as this simple code, but doesn’t crash, so it is unlikely to be a hardware fault. Any ideas? External Image
Many thanks in advance - this will probably make or break my project!