I have a N x N register array (an array which completely resides in GPU register) of a 16-bit data type (e.g. unsigned short). How much GPU registers (which are 32-bit i suppose) does it actually occupy ? N * N / 2 or N x N ?
The CUDA device compiler won’t automatically pack two 16-bit quantities into a 32-bit register.
You can discover register utilization for any particular code by using -Xptxas -v when compiling the code.
thx for the info regarding the compiler behaviour and the registrer utilization info!
In terms of register usage, it would have been etter for my use case (2-D convolution on ‘16-bit float’ images) if the compiler would pack the register arrays (and generate the necessary instructions to ‘unpack’ a 16-bit value from the upper|lower part of a 32-bit register). But there might be valid reasons for the compiler to not do it - e.g. some register bank conflicts etc.
@HannesF99, another option is to explicitly use the vector data type ‘ushort2’ and it will occupy 1 32-bit register.
There are also 16-bit SIMD opcodes documented in the PTX ISA guide:
While the SIMD instructions are useful for integer operations (and thus exposed in CUDA as device functions / intrinsics), I don’t see how they would help with handling FP16 data in HannesF90’s use case.
Modern GPU architectures do not support addressing the high and low half of a 32-bit register separately (something supported by sm_1x GPUs if I recall correctly) so there is no straightforward way to access 16-bit data in 32-bit register. Using extra instructions to extract and insert data would be a lot of additional work, so I do not think this makes much sense from a performance perspective and it would probably create a register allocation nightmare if implemented in a compiler.
You could give the packing / unpacking approach a try by doing this manually with the __byte_perm() intrinsic. Best I know, current GPUs do not have support for operations on packed FP16 data, so a lot of packing and unpacking would be necessary, as operands need to be unpacked before being operated on as single-precision data.
How large is N in the use case? Modern GPUs (>= sm_35) provide quite a few registers so unless the value of N is quite large I do not see how packed storage would help a lot. Is register pressure really the most pressing performance limitation?
@HannesF99 didn’t ask about FP16 in the original question.
The question was a theoretical, “How many registers?” and not “is this a good idea?”
The ushort2 would be aliased/unioned as a u32 for the SIMD PTX instruction. As you note, the non-SIMD ops would have to be split into 32-bit 16v2>32>[operations]>32>16v2 sequences.
Additionally, SWAR-style operations could be used on u32/u16v2 registers.
Of course it would be slow if the code wasn’t a heavy user of the native SIMD ops… but it might be compact.
Additionally, many non-CUDA mobile/integrated GPUs now have robust FP16 SIMD support.
I noticed that I mangled HannesF99’s screen name to HannesF90. Sorry about that.
For integer data that naturally exists in packed-byte or packed-short form, use of CUDA SIMD intrinsics is highly recommended on Kepler-based GPUs, and I would even recommend trying this approach on other GPUs where they are emulated (either partially or completely). FWIW, as far as I am aware the inefficiencies of the SIMD emulation on sm_50 have been addressed.
I readily admit that I do not have any insights into non-NVIDIA GPUs, so please read “current GPUs” as “current NVIDIA GPUs”. The last time I looked at FP16 was on ARM and x86, and at the time I found only support for FP16 as a storage format, i.e. same as current NVIDIA GPUs. I would be interested to learn about use cases for packed FP16 and which platforms offer native support beyond load and store instructions.
The only use case for FP16s that I’ve personally dealt with is programmatically compositing pixels (vs. using fixed-function hardware). An FP16v4 is a really nice representation of a pixel and if you have FP16 FMAs then it’s performant too.
Semi-related: as mentioned in another thread, it looks like there is a new set of FP16 v2/v4 atomic operations available in OpenGL. I think it’s Maxwell only.
FP16 are very useful in image processing tasks, like in my case for calculating wavelet transform. As most basic image processing operations have very low arithmetic intensity the extra operations for packing and unpacking a FP16 should not be a problem. I use the register arrays in order to prefetch an small image region into a register as described in http://parlab.eecs.berkeley.edu/publication/899 . I suppose N practically might be up to 7, or so.