Serious hardware bug?

It seems to me that I have found a serious hardware bug. See the sources attached.
I use CUDA compiler 2.3 on Windows XP 64 bits (but it does not matter because PTX code seems to be correct).
I run this program on TESLA C1060.

The code contains many senseless assignments however without them the bug disappears.

As you can see:

All blocks perform the same calculations (allmost all - by thread #0).
The member rThresholdInhibition of the shared structure S.nhc is set to 1 by zero thread at the beginning of the kernel.
From the code we see that only the two things may happen with it - it may be either increased by 0.0001 or decreased by 0.0001 (in fact it is decresed as it can be easily seen). Nevertheless when we check its value returned by block#0 via pDnhc pointer at appears to be almost equal to 0. Almost any modifications of this code destroy this effect and the program begin to work correctly.

Would be glad to see any comments. (1.42 KB) (7.43 KB)

I haven’t looked at your code in detail, but does it run correctly in emulation mode? Are you sure your unexpected result isn’t a floating point rounding error?

BTW, a blocksize of 176 threads isn’t optimal (should be a multiple of 32).

It runs correctly in emulation mode. It runs correctly even if I delete one of senseless assignments in this kernel.

It cannot be rounding error. The only operation perfomed with this variable is subtraction of 0.0001 from 1. The result cannot be negative as it is in this program.

176??? Block size in my case is 88+8=96 - a multiple of 32.