16-bit vs 32-bit Integer Arithmetic Performance

user112854 · April 7, 2024, 1:36pm

Hi, CUDA newbie here. I understand that registers are always 32-bit, and most of the Integer Intrinsics operate on 32-bit integers. I’m curious what happens when I perform integer operations on 16-bit integers? The particular operations I have in mind are left/right shifts, but I’d love to understand this in general as well.

(1) Are the outputs 32-bit integers or 16-bit integers?
(2) More importantly, what are the performance implications? (For example, are they first converted into 32-bit integers before performing the operation, which would be very slow?)

Thanks in advance for the help!

njuffa · April 7, 2024, 3:04pm

Simplifying slightly, the rules for expression evaluation in C++ require that integer data of a type narrower than int is widened to int. C++ allows compiler optimizations as long as generated code behaves as if it were following the abstract execution rules exactly.

This means than an expression consisting entirely of operations on int16_t data, with the result being delivered to an int16_t destination, may be evaluatable using only 16-bit integer operations provided by a processor. However, by and large GPUs do not provide such operations. As far as shifters in the GPU hardware are concerned, best I know they are all 64->32 bit funnel shifters (SHF instruction) these days, and have always been at minimum 32-bit barrel shifters.

Use of integer types narrower than int may lead to additional conversion instruction being emitted in generated code. Whether this presents a performance issue depends on the specific context in which it occurs.

A useful rule of thumb that I once learned from an experience software engineer with 25 years of experience at the time and have found to hold true in the 25 years since: In C and C++, every integer wants to be int, unless there is a good reason for it to be some other type. For example, in contexts involving bit manipulation it is usually advantageous to use unsigned int instead, due to complications with shift operations on signed integer types.

Narrow integer types may offer performance advantages due to compactness of storage, for example where block copies of any kind occur. Of course, block copies of data should generally be minimized, as data movement not involving data processing tends to be wasteful in terms of time and energy expenditure. When in doubt, using the profiling tools available for CUDA can settle the question whether use of 16-bit integers actually improves performance.

user112854 · April 7, 2024, 9:19pm

This is a very well-written explanation — thank you so much!

system · April 21, 2024, 9:19pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Performance on 16 bit vs. 32 bit integers? Huge performance drop CUDA Programming and Performance	1	840	October 19, 2010
PTX,... does comparing a bit either a 0 or 1 take 64 bits? CUDA Programming and Performance	3	497	April 13, 2018
Performance on 16 bit vs. 32 bit integers? Huge performance drop CUDA Programming and Performance	4	15361	February 16, 2011
do bool/ char types imply inherent type conversion? CUDA Programming and Performance	3	1483	November 2, 2014
Integer Arithmetic 32 integer arithmetic performance CUDA Programming and Performance	4	6871	March 7, 2007
integer arithmetic capabilities of Tesla GPUs & definition of terms CUDA Programming and Performance	5	1793	December 6, 2017
signed vs unsigned int for indexes and sizes CUDA Programming and Performance	9	13816	October 8, 2016
A question about calculation of integer (or short integer) and float data CUDA Programming and Performance	8	3352	April 4, 2014
Persistent Const Registry values or Initializing Registry from Host CUDA Programming and Performance	9	148	June 19, 2024
CUDA FAQ posted CUDA Programming and Performance	3	6323	May 22, 2007

16-bit vs 32-bit Integer Arithmetic Performance

Related topics