Incorrect x86 instruction emitted by Emu nvcc Right-shift is broken!

When presented with code such as:

int a = 128;
short i = 32;
int result = a >> i;

NVCC assigns to ‘result’ the value 128 instead of 0. If i=33, then result==64 instead of 0. In short, only the lowest five bits of i are considered. This is because nvcc directly uses the x86 SAR instruction (shift-arithmetic-right) which DOES NOT HAVE C-MANDATED BEHAVIOR! That’s all besides the fact that SAR can only take an 8-bit argument, even though the variable in my code is clearly 16-bit.

For comparison, VC++ emitts a call to _allshr.

CUDA hardware seems to work correctly.

Here’s a repro program. My system: CUDA 2.0b2, Vista x32, 8600GT.

Debug and Release display ‘0’ (correct result)
EmuDebug displays ‘128’ (incorrect)
Interestingly, EmuRelease displays ‘0’. I can’t breakpoint it to see which asm instructions are being used.
Simple_CUDA_app.rar (974 KB)

x86 reference explaining the behavior of SAR:…/nasmmanual.pdf

You’re not alone, I’ve faced the exact same issue :P Everything worked just fine in emulation mode, but not on the device :P Later, it turned out that I was shifting by the wrong variable, that just happened to work in emulation mode. Still, the emulator should be made consistent with the behavior on the device, I hope someone is working on that.

C99 standard, section 6.5.7 “Bitwise shift-operators” says among other things:

“If the value of the right operand is negative or greater than or equal to the width of the promoted left operand, the behavior is undefined

So for a shift count of 32 or 33 used on a 32-bit integer, all the bets are off.

Just for the record, gcc produces the same results you are claiming incorrect :

int main()


int a = 128;

short i = 32;

int result = a >> i;


i = 33;

int result2 = a >> i;



$ a.out



Ok mfatica, the customer is always wrong, it’s rational that 128>>33 == 64, it’ll take a legion of programmers to change one instruction, etc., etc.

Tell me, why does the device give one result and the emulator another? Just to keep us on our toes?

We gave you an explanation, you are doing an operation that has an undefined behavior.
We are following the C99 standard, should we follow the Dubinsky’s standard?

gcc is giving the same result (128 >>33 == 64). Are they wrong too?

Regarding the difference between the device and emulation, if you add a -v flag, you will see that the emulation code is actually compiled by the host compiler.

nvcc -v -deviceemu bug.c # SPACE=
# _MODE_=EMULATE # HERE=/usr/local/cuda/bin
# _THERE_=/usr/local/cuda/bin # TOP=/usr/local/cuda/bin/…
# PATH=/usr/local/cuda/bin/../open64/bin:/usr/local/cuda/bin/../bin:/usr/local/cuda/bin:/usr/local/visit/bin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin:/usr/texbin:/usr/X11/bin # INCLUDES="-I/usr/local/cuda/bin/…/include"
# LIBRARIES= "-L/usr/local/cuda/bin/../lib" -lcudart # CUDAFE_FLAGS=
# gcc -D__CUDA_ARCH__=100 -c -x c -DCUDA_FLOAT_MATH_FUNCTIONS -DCUDA_NO_SM_12_ATOMIC_INTRINSICS -DCUDA_NO_SM_11_ATOMIC_INTRINSICS -DCUDA_NO_SM_13_DOUBLE_INTRINSICS "-I/usr/local/cuda/bin/../include" -I. -m32 -malign-double -o "/tmp/tmpxft_00003ded_00000000-1_bug.o" "bug.c" # g++ -D__CUDA_ARCH__=100 -m32 -malign-double -o “a.out” “/tmp/tmpxft_00003ded_00000000-1_bug.o” “-L/usr/local/cuda/bin/…/lib” -lcudart

lol mfatica i know “why” it’s giving a different result in hardware and in emulation. You don’t see anything wrong with it?

P.S. Why are you using gcc in the backend if you’re asking for the VC++ binary in the call to nvcc? In any case, wouldn’t it be trivial to change this and give Windows users the compiler behavior they expect? In any case, passage and versing the Standard (I know I brought it up–sorry) is a horrible way to respond to customers. It’s such Ivory Tower bs it boils my blood. There’s things besides “standards”… like reasonable behavior that won’t create frustrating bugs in your customers’ code. If you have to say something, say “it’s impossible for us to change the compiler, we’re really sorry.”

Though I agree with Mfatica, undefined is undefined, this is a good point IMO.

It would be very useful to have information on how hardware code differs from standard x86 CPU compiler code, or a small toolkit - to help with setting the GCC / Microsoft compilers to produce emulation binaries that are as similar as possible to the CUDA binary code.

CUDA users have discovered this bit by bit, and some of it is specified implicitly in the manual, but NVIDIA is in the best position to know exactly/approximately where the difference is, and it would be excellent to have a compendium. Hopefully it would be worth the CUDA developers’ time to create one …

If you are doing an undefined operation ( that means your code is unsafe and probably going to give you wrong results changing compilers or platform), how could you expect to get the same result from different compilers?

I don’t use Windows (so I cannot replicate your VC++ claim), these are the results of your code on 4 differents compilers/platforms:

  1. gcc 4 on MacosX: 128 and 64
  2. gcc 3.4 on Linux: 128 and 64
  3. Intel C (icc) on Linux: 128 and 64
  4. Intel C++ (icpc) on Linux : 128 and 64

According to you, all these compilers are wrong.
Nvcc could certainly be improved and we are grateful when users report bugs, I just don’t agree with you that we should change its behavior in this case.

Having a different behavior between device and emulation in undefined cases, it is IMHO a good thing to have if detects flawed code.

What would you use this for? The only thing I can think of would be trying to track down compiler bugs, but it seems to me that any bug you’d find would be far more likely to be from the CPU-side toolkit than the compiler itself.

Simply to be more sure that differences between device and emulation code are my own fault - errors in CUDA code - and not toolchain-induced :)

I can understand that you may see it differently (and I understand your point that it mkes debugging even more complex), but IMHO in this case I would consider it “your fault”, if you use such shifts in you current CUDA code I think you risk that your code breaks with any new GPU generation.

I’m not the original poster - and I’m not using shifts like these. I’m speaking in general terms: The NVIDIA devices behave a bit differently than standard x86 CPUs. Some compiler flags may generate x86 code that behaves more alike to CUDA - it’d be nice, not essential, to have a collection of these flags in one place.