Mystery BUG

I’m getting a very weird CUDA bug…

Here’s my code:
device uint hashVal ( float4 pos )
{
uint gz = (pos.z - simData.min.z) * simData.delta.z;
uint gy = (pos.y - simData.min.y) * simData.delta.y;
uint gx = (pos.x - simData.min.x) * simData.delta.x;
return max(0, min(gz, (uint) simData.resmax));
}
global void hashParticles ( char* bufPnts, uint* bufHash, int numPnt )
{
uint ndx = __mul24(blockIdx.x, blockDim.x) + threadIdx.x;
if ( ndx >= 0 && ndx < numPnt ) {
float4* pos = (float4*) (bufPnts + ndx * simData.stride);
uint* clr = (uint*) ((char*) pos + 3*sizeof(float) );
uchar r = ((30+pos->x)/60)*255.0f;
uchar g = ((20+pos->y)/40)*255.0f;
uchar b = 0;
*clr = uint(r<<24) | uint(g<<16) | uint(b<<8) | 255;
// if ( ndx <= 100000000 ) // error disable
bufHash[ndx] = hashVal ( *pos ); }
}

It takes a particle, modifies its color and outputs a hash value.

The bug is observed as bad values returned by bufHash and also in *clr.
It occurs in the code above, with the commented line as is. These are what bad *clr values look like:
3064791295 255 2309816575 255 1437401343 255 3282895103 255 3182231807 255 2307916031 255 2846687487 255

When the comment on “error disable” is removed, the values are all good:
3282895103 2964127999 2846687487 2410479871 3064791295 2192376063 1655505151 2527920383 2091712767 1219297535 3182231807 2963472639 1873608959 3383558399

I noticed the bad values (255) are interleaved, and they correspond to what would happen if pos->x and pos->y were 0.0 – yet all pos are ok on kernel entry.

I think I’ve ruled out any obvious error. The only thing I can think of is some issue with doing two global memory writes in a row?

Any help is appreciated! Otherwise I’ll have to just leave the “if (ndx<=10000000)” line in.

What are the dimensions of the grid and block? How large is the array bufHash? Is this numPnt? What is it’s value?

I’m suspicious about indexing out of bounds…

Additionally, which version of CUDA are you using? Which GPU(s) ?

If you believe this to be a CUDA bug, then please attach a complete test app which reproduces the problem, along with build instructions.

Hi… Brief answers now, I’ll try and isolate the bug as a stand-alone app later.

void computeNumBlocks (int numPnts, int minThreads, int &numBlocks, int &numThreads)
{
numThreads = min( minThreads, numPnts );
numBlocks = iDivUp ( numPnts, numThreads );
}
void SetupCUDA ( int num, int stride, float3 min, float3 max, float3 res, float3 size, int chk )
{
fcuda.pnts = num;
fcuda.cells = res.xres.yres.z;

computeNumBlocks ( fcuda.pnts, 256, fcuda.numBlocks, fcuda.numThreads);		
computeNumBlocks ( fcuda.cells, 256, fcuda.gridBlocks, fcuda.gridThreads);		    
fcuda.szPnts = (fcuda.numBlocks * fcuda.numThreads) * stride;        
fcuda.szHash = (fcuda.numBlocks * fcuda.numThreads) * sizeof(int);
fcuda.szGrid = (fcuda.gridBlocks * fcuda.gridThreads) * sizeof(uint);    
fcuda.stride = stride;
CUDA_SAFE_CALL ( cudaMalloc ( (void**) &bufPnts, fcuda.szPnts ); )
CUDA_SAFE_CALL ( cudaMalloc ( (void**) &bufHash, fcuda.szHash ); )
CUDA_SAFE_CALL ( cudaMalloc ( (void**) &bufGrid, fcuda.szGrid ); )

}

Size of bufHash is as above, basically numPnts (=fcuda.pnts=num), with some extra padding to round out the last block. This allows me to avoid the ndx < numPnts check in kernel. Although if i keep the check it in, as i have in the example, I still see the bug. bufPnts is an array of 88 byte structs (stride = simData.stride = 88). First 12 bytes are pos floats, next 4 are color. I’ve observed the bug with numPnts as small as 10, and as large 4000 (haven’t checked outside this range).

Hardware is a GeForce 8800 Ultra running on an AMD Athlon 64 X2, Brisband 4200+ with an Asus M2N32 board under Windows XP. CUDA version 2.0. (All latest versions of toolkit, CUDA and SDK downloaded last week.)

I will try and build an isolated test app later.

Ok… I’ve isolated the bug in a console app:

[url=“http://www.rchoetzlein.com/mystery.zip”]http://www.rchoetzlein.com/mystery.zip[/url]

I’ve simplified the code and made it deterministic - I can compute what the output values should be. As its set up, correct (expected) output is fifteen 17s:
17 17 17 17 17 17 17 17 17 17 17 17 17 17 17

Here is the output with the bug enabled:
17 10 4 3 17 10 4 3 17 10 4 3 17 10 4

Here is the output with the “hack line” (if ndx < 10000000) enabled:
17 17 17 17 17 17 17 17 17 17 17 17 17 17 17

To run, download and build VS2005 solution.