It seems to me that I have found a serious hardware bug. See the sources attached.
I use CUDA compiler 2.3 on Windows XP 64 bits (but it does not matter because PTX code seems to be correct).
I run this program on TESLA C1060.
The code contains many senseless assignments however without them the bug disappears.
As you can see:
All blocks perform the same calculations (allmost all - by thread #0).
The member rThresholdInhibition of the shared structure S.nhc is set to 1 by zero thread at the beginning of the kernel.
From the code we see that only the two things may happen with it - it may be either increased by 0.0001 or decreased by 0.0001 (in fact it is decresed as it can be easily seen). Nevertheless when we check its value returned by block#0 via pDnhc pointer at appears to be almost equal to 0. Almost any modifications of this code destroy this effect and the program begin to work correctly.