N x N Register array of 16-bit dattype, how much registers does it actually occupy ?

HannesF99 · September 19, 2014, 12:26pm

I have a N x N register array (an array which completely resides in GPU register) of a 16-bit data type (e.g. unsigned short). How much GPU registers (which are 32-bit i suppose) does it actually occupy ? N * N / 2 or N x N ?

Robert_Crovella · September 22, 2014, 1:54pm

The CUDA device compiler won’t automatically pack two 16-bit quantities into a 32-bit register.

You can discover register utilization for any particular code by using -Xptxas -v when compiling the code.

HannesF99 · September 23, 2014, 9:35am

thx for the info regarding the compiler behaviour and the registrer utilization info!

In terms of register usage, it would have been etter for my use case (2-D convolution on ‘16-bit float’ images) if the compiler would pack the register arrays (and generate the necessary instructions to ‘unpack’ a 16-bit value from the upper|lower part of a 32-bit register). But there might be valid reasons for the compiler to not do it - e.g. some register bank conflicts etc.

allanmac · September 23, 2014, 3:26pm

@HannesF99, another option is to explicitly use the vector data type ‘ushort2’ and it will occupy 1 32-bit register.

There are also 16-bit SIMD opcodes documented in the PTX ISA guide:

vadd2
vsub2
vavrg2
vabsdiff2
vmin2
vmax2
vset2

njuffa · September 23, 2014, 5:02pm

While the SIMD instructions are useful for integer operations (and thus exposed in CUDA as device functions / intrinsics), I don’t see how they would help with handling FP16 data in HannesF90’s use case.

Modern GPU architectures do not support addressing the high and low half of a 32-bit register separately (something supported by sm_1x GPUs if I recall correctly) so there is no straightforward way to access 16-bit data in 32-bit register. Using extra instructions to extract and insert data would be a lot of additional work, so I do not think this makes much sense from a performance perspective and it would probably create a register allocation nightmare if implemented in a compiler.

You could give the packing / unpacking approach a try by doing this manually with the __byte_perm() intrinsic. Best I know, current GPUs do not have support for operations on packed FP16 data, so a lot of packing and unpacking would be necessary, as operands need to be unpacked before being operated on as single-precision data.

How large is N in the use case? Modern GPUs (>= sm_35) provide quite a few registers so unless the value of N is quite large I do not see how packed storage would help a lot. Is register pressure really the most pressing performance limitation?

allanmac · September 23, 2014, 5:12pm

@HannesF99 didn’t ask about FP16 in the original question.

The question was a theoretical, “How many registers?” and not “is this a good idea?”

The ushort2 would be aliased/unioned as a u32 for the SIMD PTX instruction. As you note, the non-SIMD ops would have to be split into 32-bit 16v2>32>[operations]>32>16v2 sequences.

Additionally, SWAR-style operations could be used on u32/u16v2 registers.

Of course it would be slow if the code wasn’t a heavy user of the native SIMD ops… but it might be compact.

Additionally, many non-CUDA mobile/integrated GPUs now have robust FP16 SIMD support.

njuffa · September 23, 2014, 5:46pm

I noticed that I mangled HannesF99’s screen name to HannesF90. Sorry about that.

For integer data that naturally exists in packed-byte or packed-short form, use of CUDA SIMD intrinsics is highly recommended on Kepler-based GPUs, and I would even recommend trying this approach on other GPUs where they are emulated (either partially or completely). FWIW, as far as I am aware the inefficiencies of the SIMD emulation on sm_50 have been addressed.

I readily admit that I do not have any insights into non-NVIDIA GPUs, so please read “current GPUs” as “current NVIDIA GPUs”. The last time I looked at FP16 was on ARM and x86, and at the time I found only support for FP16 as a storage format, i.e. same as current NVIDIA GPUs. I would be interested to learn about use cases for packed FP16 and which platforms offer native support beyond load and store instructions.

allanmac · September 23, 2014, 5:58pm

The only use case for FP16s that I’ve personally dealt with is programmatically compositing pixels (vs. using fixed-function hardware). An FP16v4 is a really nice representation of a pixel and if you have FP16 FMAs then it’s performant too.

Semi-related: as mentioned in another thread, it looks like there is a new set of FP16 v2/v4 atomic operations available in OpenGL. I think it’s Maxwell only.

HannesF99 · September 23, 2014, 9:24pm

FP16 are very useful in image processing tasks, like in my case for calculating wavelet transform. As most basic image processing operations have very low arithmetic intensity the extra operations for packing and unpacking a FP16 should not be a problem. I use the register arrays in order to prefetch an small image region into a register as described in http://parlab.eecs.berkeley.edu/publication/899 . I suppose N practically might be up to 7, or so.

Topic		Replies	Views
1 or 2 registers for 16 bit data? CUDA Programming and Performance	1	4367	December 31, 2009
# of registers in different for different datatypes CUDA Programming and Performance	3	636	January 21, 2020
Saving registers with smaller data types? CUDA Programming and Performance	3	2252	January 27, 2009
why so many registers used? CUDA Programming and Performance	1	5016	February 14, 2010
Problems with compiler CUDA Programming and Performance	3	134	March 4, 2025
too many registers issue with memory writes and registers CUDA Programming and Performance	7	2065	July 13, 2011
Register in kernel CUDA Programming and Performance	3	96	November 17, 2024
Help with variable allocate in register CUDA Programming and Performance	12	1869	July 24, 2016
How to let nvcc use more registers CUDA NVCC Compiler	3	126	December 2, 2024
How do you do computation using only registers? CUDA Programming and Performance	2	732	June 28, 2022

N x N Register array of 16-bit dattype, how much registers does it actually occupy ?

Related topics