Problem with nonreproducible results on EVGA GT 220

I used CUDA to implement backprojection and reprojection algorithms for tomography. Once I got my program working, I started testing it on different machines. For one card, a EVGA GT 220, the program failed to give the same result when run repeatedly. This happened first under 64-bit Vista; I moved the card to another machine and it had the same problem under 32-bit Vista and 64-bit Linux. I exchanged the card with Newegg, and was dismayed to find that the new GT 220 had the same problem, although not on every run.
By now my program has run properly on 12 other GPUs, so it seems unlikely that there is a problem with the program. But when I run various sample programs in the CUDA SDK repeatedly, they all give the same numerical output on each run (such as L-norm values), so I’m not sure what to think.

I’ve attached a test program (source and compilation on Redhat EL 5) that shows the problem. It makes up some data, computes a backprojection on the CPU, then computes it on the GPU with three different strategies.
Each time it measures and reports the maximum difference between GPU and CPU results. On a good CPU , it gives a result like:
Max difference 36.500
Max difference 33.125
Max difference 41.750
every time it is run. On the GT 220, it will give a big difference in one of the GPU computations most of the time.

I hope some others can run the program to see if it fails anywhere else, and I’m interested in knowing whether others have seen variable results with particular cards.
tomotest.zip (15.7 KB)

Max difference 278034.750
Max difference 278035.000
Max difference 2084778.500

On the Ocelot emulator.