Random Corruption?


I’m getting some random garbled bits in a reasonably simple 3-step computation using CUDA. Each run gives me different garbage. Sometimes things are 100% OK, but the errors seem to be CPU-load dependent.

First, I pass an array of structures in, and calculate some temporary floating point array output. There is some accuracy-related differences between the reference version and the new CUDA version, but everything is OK in general.

So, step 1: ~100 int’s to ~300 floats → OK.

Then, the float output is taken, more calculations are performed, and this is bit-perfect.

Step 2: ~300 floats to ~300 ints → Perfect.

Now comes the interesting part. The results from step 2 are used in a kind of hashing operation, so I use some basic 64-bit integer math.

~300 ints → ~30 ints → BAD.

This works perfectly in a C++ reference implementation, and it works 100% in emulation mode, but running it on the CUDA device gives me super weird output data. Sometimes the output is good compared to the reference data, sometimes it is slightly garbled, and sometimes it’s 100% bad. Sometimes, 100 runs in a row will give me perfect output, and other times, not even two runs in a row will be OK. Exactly the same input data, different output, but only in step 3. Also, the values from step 3 sometimes seem to be broken in the same way - the same bits seem to be flipped or garbled.

Does anyone have any idea what is going on?

I can’t see any race conditions.

The card is not overclocked.

I’m running 64 bit Ubuntu Linux, and I’ve tried on both CUDA 1.1 and CUDA 2.0, using the correct drivers for each, on two separate machines with two different cards.

The card is not running too hot - I’ve got the fan on 100%, and the chip is at less than 50 deg. C.

Thanks for your help!

Without code for us to even consider, it’s hard to even guess.

I’d vote on a race condition just because that fits the symptom of unpredictable output, but really there’s no way to tell. Post the smallest simplification of your program that fails here and there’s a lot more people who’d probably look at it as a puzzle.

Are you running a GeForce or Tesla?

Very True - thanks for the advice. Thing is, I have to ask the guys that gave me the reference code for permission. Wanted to hear what the gentle folks here had to say, first.

GeForce 8800 GTS. Is that better or worse?

It’s possible you have a bad card (Teslas are qualified to much higher standards than GeForces), but it’s more likely to be a race condition. If you want to send me the code, I’ll run it on my workstation and see what the results are.