It seems to me that I have found a serious hardware bug. See the sources attached.
I use CUDA compiler 2.3 on Windows XP 64 bits (but it does not matter because PTX code seems to be correct).
I run this program on TESLA C1060.
The code contains many senseless assignments however without them the bug disappears.
As you can see:
All blocks perform the same calculations (allmost all - by thread #0).
The member rThresholdInhibition of the shared structure S.nhc is set to 1 by zero thread at the beginning of the kernel.
From the code we see that only the two things may happen with it - it may be either increased by 0.0001 or decreased by 0.0001 (in fact it is decresed as it can be easily seen). Nevertheless when we check its value returned by block#0 via pDnhc pointer at appears to be almost equal to 0. Almost any modifications of this code destroy this effect and the program begin to work correctly.
Would be glad to see any comments.
HORATIO.kernel.cu (1.42 KB)
HORATIO.cu (7.43 KB)