Functional Difference Emu vs. G80 (shift macro)

Took me forever to track down a bug in my code, seems to be this

#define SHL(x, s) ((unsigned int) ((x) << (s) ))
#define SHR(x, s) ((unsigned int) (( (unsigned int)x) >> (32 - (s)) ))

This code is yielding different results on the emu and the G80 hardware. Specifically, the results in the emu are as expected and the results from the G80 mismatch.

The macros have been changed from s to (s&31) and this fixed the problem.

I was wondering if there was a reason the emu and the hardware would perform different in this case. Are the shifts treated differently on each? IIRC, the emu just runs CPU threads and not any PTX/NV-generated code. I am guessing that the handling of out-of-range shift values is somehow inconsistent.

Indeed, shifting more to the left or right than the width of the type can result in different behaviour on different architectures.

My point is more that they result in different behavior on the G80 and under emulation - making kernel results inconsistent. Is there a reason why it is not compiled to perform the same operation on the G80 as the CPU? (since we know very little of the actual G80 hw at that level). If its just an arbitrary decision, it seems the G80 code should be made to match the emulation results.

These macros are invoking undefined behavior under both the C

and C++ standards (and thus also in CUDA) if the effective

shift count is greater than or equal to 32. This is the case

when s >= 32 for SHL or when s = 0 for SHR. The difference

between x86 and G80 is that 32-bit x86 truncates the shift

count to the lower 5 bits, whereas G80 does not.

C standard (1999), section 6.5.7 “Bitwise shift operators”

[…]

The type of the result is that of the promoted left operand.

If the value of the right operand is negative or is greater

than or equal to the width of the promoted left operand, the

behavior is undefined

If you are using these to construct a rotate from SHR and SHL

the following should work better within the restrictions of

the standards:

#define ROTL(x,s) ((((unsigned int)(x))<<((s)&31)) | \

                (((unsigned int)(x))>>((-(int)(s))&31)))

It’s not really an arbitrary decision; << is compiled to a shift left instruction, >> to a shift right instruction. No clamping or other preprocessing on the values is done (imagine how inefficient this would be), so what you get is the result on your architecture.

As you see the ‘emulator’ doesn’t emulate G80 instruction set at all, just the grid/blocks/threads architecture.

Dan, You would have a beef if they called it a simulator, but it is not. A simulator would be at least an order of magnitude slower and should produce bit identical results to the hardware - including all floating point operations and also trap operations with indeterminate output (concurrent write from different threads).
Cheers, Eric

Thanks for all the replies :)

I can see why I get different results, the G80 hw implements shift different than x86 does. But I still have the same question though, only now about the hardware: why do it differently in the context of the G80? ( I suspect the answer to this could be unrelated to software, but maybe there is a gfx-specific reason to provide different behavior? )

I’m not aware of any hw manual, so what does the G80 do? (I am sure I could write code to find out, but I’ll be lazy and just ask…)

As wumpus says, clamping is inefficient and this is now what I am doing (ok well masking really … which would be free in HW ) because it is not done in hardware.

Although, I suppose this is really a side issue now and not CUDA-specific anymore…