Strange crash on compute 1.0 hardware - works on 1.3 Warp wide shared memory read broadcast crashes

I discovered a rather nasty issue while making a per warp random number generator.

shared unsigned int randStates[32];

w = __int_as_float((randStates[0]&0x007FFFFF)|0x3F800000)-1.0f;

^
hangs the kernal

if ((threadIdx.x&31)!=0)
w = __int_as_float((randStates[0]&0x007FFFFF)|0x3F800000)-1.0f;

^
works as long as at least one thread isn’t reading the address

Both cases work on my GTX295, but only the second one works on my 8800GTX. It appears that there’s some sort of problem with shared memory read broadcast, but only if all threads in the warp are reading the same address. If any of the threads is inactive, or reading a different address, things appear to work correctly.

Needless to say, this is a rather annoying bug, but at least has a work around.

w=randFloatWarp();
if ((threadIdx.x&31)!=0)
w = __int_as_float((randStates[0]&0x007FFFFF)|0x3F800000)-1.0f;

^
randFloatWarp returns __int_as_float((randStates[threadIdx.x&31]&0x007FFFFF)|0x3F800000)-1.0f;
thereby having thread 0 of the warp get the result there, while the rest of them are broadcast, since all but one thread reading works.