Problem with left shift

I’ve a problem with the “<<” operator (left shift).

Somewhere in my code there is the following define:

#define ROL64(a, offset) ((offset != 0) ? ((((UINT64)a) << offset) ^ (((UINT64)a) >> (64-offset))) : a)

When I try to compile my program I obtain a very strange error:

### Assertion failure at line 1923 of ../../be/cg/cgemit.cxx:

### Compiler Error in file /tmp/tmpxft_000022e5_00000000-7_kernel.cpp3.i during Assembly phase:

### incorrect register class for operand 1

nvopencc INTERNAL ERROR: /opt/cuda/open64/lib//be returned non-zero status 1

This error disappears if, instead of using the “<<”, I use the “>>” (right shift). I do not understand this behavior. By now I’m using the following “trick”

#define ROL64(a, offset) ((offset != 0) ? ((((UINT64)a) >> (0-offset)) ^ (((UINT64)a) >> (64-offset))) : a)

but I’m not sure this has the expected behavior.

Any Ideas?

I submitted a similar bug to NVIDIA a few weeks ago which is clearly a compiler bug. This is probably the same one. It had to do with a shifts of 64 bit long values when the shift had integer math in it.

It only failed for sm_20 and not sm_13.

As a workaround, just put the shift amount into a temporary variable.

// compile with nvcc -arch sm_20

__device__ unsigned long circularShift(unsigned long v, int shift)


    return (v<<(64-shift));


int main() {}

I recall a compiler bug being reported regarding 64-bit rotates recently. If memory serves, one workaround is to assign the left-shift and right-shift portions to temporary variables, then combine them (with | or + for example) in a separate step. Unfortunately that won’t work for a macro due to the need for temp variables. Could you use a forceinline function instead?

BTW, negative shift counts cause undefined behavior in C/C++, so assuming that offset is >= 0, ((a) >> (0-offset)) will not in general give the desired result.

Just as an aside, since shifts aren’t a common topic usually worth posting for: GPU shifts behave differently than CPU shifts when the shift amount exceeds the word size in bits.

This is undefined behavior in the C standard, and is one of the few cases where NVIDIA GPUs differ in their integer computation from Intel and AMD CPUs.

On the CPUs, a shift amount >= word size will end up using the low order bits of the shift, so a integer shift of 33 is the same as a shift of 1, for example.
On the GPU, the shift value is saturated, so a shift of 32 or more acts like 32.
This is true for right or left shifts.

Again this is not a bug, it’s documented as undefined behavior in C itself. It’s interesting there’s a difference though.

Thank you very much for the detailed informations :)

I modified my code in this way:

// NVCC Bug

//#define ROL64(a, offset) ((offset != 0) ? ((((UINT64)a) << offset) ^ (((UINT64)a) >> (64-offset))) : a)

__device__ inline UINT64 ROL64(UINT64 a, unsigned int offset)


        const int _offset = offset;

        return ((offset != 0) ? ((a << _offset) ^ (a >> (64-offset))) : a);


By now I hope it will be ok, however I have to check the range of the variable “offset” in order to avoid misalignment between the CPU and GPU code.