>=32bit shift on 32bit integer

Hi,

Shifting 32bit integers using a number > 31 can lead to undefined behaviour in CPUs. I read in old forum posts that this does not seem to be the case in GPUs. The current PTX documentation seems to indicate:

"Signed shifts fill with the sign bit, unsigned and untyped shifts fill with 0. "

and

“Shift amounts greater than the register width N are clamped to N .”

Is it safe to assume that the results of left/right shifting a 32bit unsigned integer by more than >=32bits (when using nvcc) is going to return 0 as a result in current and previous nVidia GPUs?

Is there any expectation of keeping this behaviour in future architectures?

Many thanks for the help.

The behavior of shifts with shift count greater than or equal to the bit-width of the integer data type is undefined at the language level, in particular in C++, which is the language family to which CUDA belongs.

In decades of experience with all kind of processors, what happens with large shift counts is well-defined at the hardware level, i.e. the relevant machine instructions. The two common models are to either wrap the shift count (modulo the bit-width) or saturate it to a limit. x86 chose the former and NVIDIA GPUs generally chose the latter (although I seem to recall that for some NVIDIA GPUs wrap/clamp mode on shifts was selectable). In general, the machine instruction set of NVIDIA GPUs has seen large changes historically, and specifically there has been no notion of binary compatibility at the machine code level. So there are no guarantees regarding future GPUs.

For NVIDIA GPUs in particular, hardware instructions are publicly documented in a superficial manner only, and there is not supported way to program at the machine code level. However, you can rely on the behavior defined in the PTX specification when programming at that level. You cannot rely on anything that the PTX specification does not guarantee. In HLL CUDA code, you would want to stick to C++ specifications (or CUDA specifications where deviating from C++). What is __device__ code today may become __host__ __device__ code tomorrow.

1 Like

Hi njuffa,

I was hoping for a confirmation that I may be able to use shifts larger than 32bits, but perhaps it would be better to find alternative approaches.

Many thanks for your answer and taking the time to write it.

Best,

Depending on your exact usage requirement, you could possibly achieve the desired result through use of the “shf” instruction. Using 0x00000000 as the second operand appropriately for shf.l or shf.r would guarantee a zero result:

https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#logic-and-shift-instructions-shf

Recent, (limited), exposure to Cuda 8.6 SASS makes me wonder how much longer the “shl” and shr" instructions will be around in future hardware, as all shift operations were being done using only the shf instruction.

1 Like

oh, many thanks. That is a good solution.

Thanks for taking the time to post it.

Best,