Took me forever to track down a bug in my code, seems to be this
#define SHL(x, s) ((unsigned int) ((x) << (s) ))
#define SHR(x, s) ((unsigned int) (( (unsigned int)x) >> (32 - (s)) ))
This code is yielding different results on the emu and the G80 hardware. Specifically, the results in the emu are as expected and the results from the G80 mismatch.
The macros have been changed from s to (s&31) and this fixed the problem.
I was wondering if there was a reason the emu and the hardware would perform different in this case. Are the shifts treated differently on each? IIRC, the emu just runs CPU threads and not any PTX/NV-generated code. I am guessing that the handling of out-of-range shift values is somehow inconsistent.