Hi, CUDA newbie here. I understand that registers are always 32-bit, and most of the Integer Intrinsics operate on 32-bit integers. I’m curious what happens when I perform integer operations on 16-bit integers? The particular operations I have in mind are left/right shifts, but I’d love to understand this in general as well.
(1) Are the outputs 32-bit integers or 16-bit integers?
(2) More importantly, what are the performance implications? (For example, are they first converted into 32-bit integers before performing the operation, which would be very slow?)
Simplifying slightly, the rules for expression evaluation in C++ require that integer data of a type narrower than int is widened to int. C++ allows compiler optimizations as long as generated code behaves as if it were following the abstract execution rules exactly.
This means than an expression consisting entirely of operations on int16_t data, with the result being delivered to an int16_t destination, may be evaluatable using only 16-bit integer operations provided by a processor. However, by and large GPUs do not provide such operations. As far as shifters in the GPU hardware are concerned, best I know they are all 64->32 bit funnel shifters (SHF instruction) these days, and have always been at minimum 32-bit barrel shifters.
Use of integer types narrower than intmay lead to additional conversion instruction being emitted in generated code. Whether this presents a performance issue depends on the specific context in which it occurs.
A useful rule of thumb that I once learned from an experience software engineer with 25 years of experience at the time and have found to hold true in the 25 years since: In C and C++, every integer wants to be int, unless there is a good reason for it to be some other type. For example, in contexts involving bit manipulation it is usually advantageous to use unsigned int instead, due to complications with shift operations on signed integer types.
Narrow integer types may offer performance advantages due to compactness of storage, for example where block copies of any kind occur. Of course, block copies of data should generally be minimized, as data movement not involving data processing tends to be wasteful in terms of time and energy expenditure. When in doubt, using the profiling tools available for CUDA can settle the question whether use of 16-bit integers actually improves performance.