The behavior of shifts with shift count greater than or equal to the bit-width of the integer data type is undefined at the language level, in particular in C++, which is the language family to which CUDA belongs.
In decades of experience with all kind of processors, what happens with large shift counts is well-defined at the hardware level, i.e. the relevant machine instructions. The two common models are to either wrap the shift count (modulo the bit-width) or saturate it to a limit. x86 chose the former and NVIDIA GPUs generally chose the latter (although I seem to recall that for some NVIDIA GPUs wrap/clamp mode on shifts was selectable). In general, the machine instruction set of NVIDIA GPUs has seen large changes historically, and specifically there has been no notion of binary compatibility at the machine code level. So there are no guarantees regarding future GPUs.
For NVIDIA GPUs in particular, hardware instructions are publicly documented in a superficial manner only, and there is not supported way to program at the machine code level. However, you can rely on the behavior defined in the PTX specification when programming at that level. You cannot rely on anything that the PTX specification does not guarantee. In HLL CUDA code, you would want to stick to C++ specifications (or CUDA specifications where deviating from C++). What is
__device__ code today may become
__host__ __device__ code tomorrow.