shorts in registers any memory/performance benefit?

Shorts obviously take up less memory than ints in device memory and shared memory. My question is what happens with shorts in registers.

If I understand correctly, section 7.6.1 (page 36) of the PTX documentation:
http://www.nvidia.com/object/io_1195170102263.html
seems to say that shorts are promoted to ints, and therefore 1 short takes up just as much register space as 1 int, and so there is no memory benefit from using shorts instead of ints for register variables.

There would be performance benefits if there are too many register variables in the device code and so some have to spill over into local memory in the device memory.

Also, the compiler should translate multiplication of two shorts into use of the faster __mul24 operation, so using shorts would give cleaner code when going for max performance.

Can anyone confirm or contradict these comments?

Presumably the same comments apply to chars?

What about a short2? Is it stored in a single register, or in 2 registers?

If squeezed for register space, I know we can always manually code things to use bit shifting to concatenate 2 shorts into 1 int to go into 1 register. It’s a bit tedious though, and would involve a performance hit because of the cost of the bit shifting. But if the alternative is spillover into local memory then it might be worthwhile in some circumstances.

Mike

You understand correctly. Registers are 32-bit. Any type that is smaller in size is just stuffed into a 32-bit register.

The compiler treats a short2 as just 2 independent short variables. A short2 would use 2 registers.

You can always run “nvcc -ptx source.cu --opencc-options -LIST:source=on” and read the ptx code to see what the compiler is doing. This could answer your __mul24 question.

Thanks, it’s great to get the confirmation, and I will try looking at the ptx code – I’ve avoided doing that so far.

Yes, using shorts wisely can reduce your register count quite a lot. It can also increase it in some cases, because some microcode instructions require the input or output argument to be 32 bit, so an extra temporary is used.

So it is really a matter of trying…

This means the previous poster is wrong; even though the registers are 32-bit, the halves are addressable separately by a lot of instructions.

Trust wumpus, he knows much more about cubin/ptx than I do. I was just repeating something I had read somewhere else… though I don’t recall exactly where.

Now you have me pondering whether I can save a few registers in some of my code using shorts her e and there… It doesn’t hurt to try.

Probably you can; in one of my kernels I almost halved the register usage by using shorts.

BTW; In case anyone was wondering, this doesn’t extend to chars. Chars and unsigned chars are promoted to shorts when in a register. In memory and shared memory, they do take only one byte, of course.

Compiling with “nvcc --ptxas-options -v” I find that my code uses 11 registers, regardless of whether I declare my integers as ints or shorts. Presumably when it says 11 it means 11 32-bit registers?

When I declare them as shorts, looking at the ptx code I see lots of conversions from 16-bit to 32-bit, which I suspect wll hit performance, but I haven’t yet profiled it. I’ve also tried a simple multiplication of two shorts and it does indeed automatically use the more efficient __mul24.

I’ve noticed two other oddities. One is that it uses 64-bit addresses for shared variables, though maybe it demotes these to 32-bit when the ptx code is in turn compiled into executable code?

The other is that it does a __mul24 followed by an add, when I would have expected it to do a single combined __mad24
mul24.lo.s32 $r101, $r9, $r11; //
add.s32 $r73, $r73, $r101; //

wumpus, can you explain this behaviour? Or anyone from NVIDIA?