REGISTER USAGE

sepico · November 29, 2015, 3:47pm

Hi all.

I have attached as follows the sass assembly for a sm_10 gpu architecture and the output of ptxas for the same architecture.
As you can see the assembly code uses three kind of registers:
R-type: R0 to R10, R124
C-type: C0
A-type: A1 and A2

However, ptxas’ output about register usage says 11 registers.
ptxas info : Compiling entry function ‘_Z9dwtHaar1DPfS_S_jji’ for ‘sm_10’
ptxas info : Used 11 registers, 48+16 bytes smem, 4 bytes cmem[1]

The question is … why ptxas doesn’t take in acount C type registers, A type registers and R124?
Defing the register usage of a thread, shouldn’t these types of registers (C type and A type) being taken in account?

Thanks a lot.

            Function : _Z9dwtHaar1DPfS_S_jji
    /*0000*/     /*0xa0004c0d04200780*/     I2I.U32.U16 R3, g [0x6].U16;
    /*0008*/     /*0x1000d8050423c780*/     MOV R1, g [0xc];
    /*0010*/     /*0x4007040900000780*/     IMUL.U16.U16 R2, R1L, R3H;
    /*0018*/     /*0x6006060900008780*/     IMAD.U16 R2, R1H, R3L, R2;
    /*0020*/     /*0x30100409c4100780*/     SHL R2, R2, 0x10;
    /*0028*/     /*0x6006041100008780*/     IMAD.U16 R4, R1L, R3L, R2;
    /*0030*/     /*0xa000001504000780*/     I2I.U32.U16 R5, R0L;
    /*0038*/     /*0x30010801c4100780*/     SHL R0, R4, 0x1;
    /*0040*/     /*0x20008a00        */     IADD32 R0, R5, R0;
    /*0044*/     /*0x2100f804        */     IADD32 R1, g [0xc], R0;
    /*0048*/     /*0x30020001c4100780*/     SHL R0, R0, 0x2;
    /*0050*/     /*0x30020209c4100780*/     SHL R2, R1, 0x2;
    /*0058*/     /*0x2000c80104200780*/     IADD R0, g [0x4], R0;
    /*0060*/     /*0xd00e000580c00780*/     GLD.U32 R1, global14 [R0];
    /*0068*/     /*0x00020a05c0000780*/     R2A A1, R5, 0x2;
    /*0070*/     /*0x2000c80104208780*/     IADD R0, g [0x4], R2;
    /*0078*/     /*0xd00e000180c00780*/     GLD.U32 R0, global14 [R0];
    /*0080*/     /*0x2000d80904214780*/     IADD R2, g [0xc], R5;
    /*0088*/     /*0x04002001e4204780*/     R2G.U32.U32 g [A1+0x10], R1;
    /*0090*/     /*0x00020405c0000780*/     R2A A1, R2, 0x2;
    /*0098*/     /*0x04002001e4200780*/     R2G.U32.U32 g [A1+0x10], R0;
    /*00a0*/     /*0x861ffe0300000000*/     BAR.ARV.WAIT b0, 0xfff;
    /*00a8*/     /*0x00030a05c0000780*/     R2A A1, R5, 0x3;
    /*00b0*/     /*0x30010a01c4100780*/     SHL R0, R5, 0x1;
    /*00b8*/     /*0x1400e0090423c780*/     MOV R2, g [A1+0x10];
    /*00c0*/     /*0x1400e2050423c780*/     MOV R1, g [A1+0x11];
    /*00c8*/     /*0x861ffe0300000000*/     BAR.ARV.WAIT b0, 0xfff;
    /*00d0*/     /*0x2000081904014780*/     IADD R6, R4, R5;
    /*00d8*/     /*0x30040a11ec100780*/     SHR.S32 R4, R5, 0x4;
    /*00e0*/     /*0x2106f618        */     IADD32 R6, g [0xb], R6;
    /*00e4*/     /*0x20048a10        */     IADD32 R4, R5, R4;
    /*00e8*/     /*0xb000041d08004780*/     FADD R7, R2, -R1;
    /*00f0*/     /*0x30020c19c4100780*/     SHL R6, R6, 0x2;
    /*00f8*/     /*0xb000042100004780*/     FADD R8, R2, R1;
    /*0100*/     /*0x00020805c0000780*/     R2A A1, R4, 0x2;
    /*0108*/     /*0xc0330e0903f3504f*/     FMUL32I R2, R7, 0x3f3504f3;
    /*0110*/     /*0x2000cc0504218780*/     IADD R1, g [0x6], R6;
    /*0118*/     /*0xc033101103f3504f*/     FMUL32I R4, R8, 0x3f3504f3;
    /*0120*/     /*0xd00e0209a0c00780*/     GST.U32 global14 [R1], R2;
    /*0128*/     /*0x04002001e4210780*/     R2G.U32.U32 g [A1+0x10], R4;
    /*0130*/     /*0x861ffe0300000000*/     BAR.ARV.WAIT b0, 0xfff;
    /*0138*/     /*0x3080d5fd6460c7c8*/     ISET.C0 o [0x7f], g [0xa], c [0x1] [0x0], LE;
    /*0140*/     /*0x3000000300000280*/     RET C0.NE;
    /*0148*/     /*0x1001801900000003*/     MVI R6, 0x1;
    /*0150*/     /*0x1001801d00000003*/     MVI R7, 0x1;
    /*0158*/     /*0x3001d811ec300780*/     SHR.S32 R4, g [0xc], 0x1;
    /*0160*/     /*0x30040bfd640187c8*/     ISET.C0 o [0x7f], R5, R4, GE;
/*0168*/     /*0xa004900300000000*/     SSY 0x248;
    /*0170*/     /*0x1004900300000280*/     BRA C0.NE, 0x248;
    /*0178*/     /*0xa000480504200780*/     I2I.U32.U16 R1, g [0x4].U16;
    /*0180*/     /*0x200002050400c780*/     IADD R1, R1, R3;
    /*0188*/     /*0x20078008        */     IADD32 R2, R0, R7;
    /*018c*/     /*0x40031020        */     IMUL32.U16.U16 R8, R4L, R1H;
    /*0190*/     /*0x30040429e4100780*/     SHR R10, R2, 0x4;
    /*0198*/     /*0x6002122500020780*/     IMAD.U16 R9, R4H, R1L, R8;
    /*01a0*/     /*0x30040021e4100780*/     SHR R8, R0, 0x4;
    /*01a8*/     /*0x2000042904028780*/     IADD R10, R2, R10;
    /*01b0*/     /*0x30101209c4100780*/     SHL R2, R9, 0x10;
    /*01b8*/     /*0x2000102104000780*/     IADD R8, R8, R0;
    /*01c0*/     /*0x00021405c0000780*/     R2A A1, R10, 0x2;
    /*01c8*/     /*0x6002100900008780*/     IMAD.U16 R2, R4L, R1L, R2;
    /*01d0*/     /*0x00021009c0000780*/     R2A A2, R8, 0x2;
    /*01d8*/     /*0x1400e0050423c780*/     MOV R1, g [A1+0x10];
    /*01e0*/     /*0x2000042104014780*/     IADD R8, R2, R5;
    /*01e8*/     /*0x1400e0090423c780*/     MOV R2, g [A1+0x10];
    /*01f0*/     /*0xb800e02508204780*/     FADD R9, g [A2+0x10], -R1;
    /*01f8*/     /*0x30021005c4100780*/     SHL R1, R8, 0x2;
    /*0200*/     /*0xb800e02100208780*/     FADD R8, g [A2+0x10], R2;
    /*0208*/     /*0xc033120903f3504f*/     FMUL32I R2, R9, 0x3f3504f3;
    /*0210*/     /*0x2000cc0504204780*/     IADD R1, g [0x6], R1;
    /*0218*/     /*0xc033102103f3504f*/     FMUL32I R8, R8, 0x3f3504f3;
    /*0220*/     /*0xd00e0209a0c00780*/     GST.U32 global14 [R1], R2;
    /*0228*/     /*0x08002001e4220780*/     R2G.U32.U32 g [A2+0x10], R8;
    /*0230*/     /*0x30010e1dc4100780*/     SHL R7, R7, 0x1;
    /*0238*/     /*0x30010001c4100780*/     SHL R0, R0, 0x1;
    /*0240*/     /*0x30010811e4100780*/     SHR R4, R4, 0x1;
    /*0248*/     /*0xf0000001e0000002*/     NOP.S;
    /*0250*/     /*0x861ffe0300000000*/     BAR.ARV.WAIT b0, 0xfff;
    /*0258*/     /*0x20018c1900000003*/     IADD32I R6, R6, 0x1;
    /*0260*/     /*0x3006d5fd642147c8*/     ISET.C0 o [0x7f], g [0xa], R6, NE;
    /*0268*/     /*0x1002c00300000280*/     BRA C0.NE, 0x160;
    /*0270*/     /*0x307c0bfd6c0147c8*/     ISET.S32.C0 o [0x7f], R5, R124, NE;
    /*0278*/     /*0x3000000300000280*/     RET C0.NE;
    /*0280*/     /*0x30020605c4100780*/     SHL R1, R3, 0x2;
    /*0288*/     /*0x1000e0010423c780*/     MOV R0, g [0x10];
    /*0290*/     /*0x2000d00504204780*/     IADD R1, g [0x8], R1;
    /*0298*/     /*0xd00e0201a0c00781*/     GST.U32 global14 [R1], R0;

njuffa · November 29, 2015, 5:02pm

Counter questions: Why should “special” register be included in the register statistics produced by PTXAS? What purposes would this serve, what significant advantages would it have for programmers?

Since sm_1x is an obsolete architecture no longer supported by CUDA, in practical terms the answers to these questions are only of historical interest.

For as long as PTXAS has reported register use, this has only covered use of general purposes registers (GPRs). The main idea behind this was to give programmers some idea about register pressure with respect to GPRs, which was a significant problem in early GPU architectures. To my recollection, register pressure for condition codes and address registers was not a major issue even in the early days of CUDA.

sm_1x did not implement a dedicated zero register (RZ) like modern GPU architectures, instead the compiler used a trick to utilize R124-R127 as zero registers, which were therefore always unavailable for use by programmer’s code. The multiple 4-bit (I think) condition code registers of sm_1x have disappeared in modern architecture, replaced by a cleaner design that uses multiple 1-bit predicate registers. Likewise, the address registers used in sm_1x to address shared memory were dropped, and regular GPRs are now used for this purpose.

sepico · November 29, 2015, 5:53pm

NJUFFA thanks for your kindly reply.

Have you any reference stating clearly that only general purpose registers are taking in account when the maximum number of blocks that run concurrently in a Streaming Multiprocessor is defined?

If the usage of special registers cannot be a limiting factor that means that there are always available special registers despite that the architecture couldn’t have an infinite special register file size.

Anyway, have you any information about the size of “special register file” in old architectures?

Something more.
As follows is another SASS assembly.
In this case the general purpose registers are R0 - R15.
But ptxas says that 17 registers are used.

ptxas info : Compiling entry function ‘_Z15BlackScholesGPUPfS_S_S_S_ffi’ for ‘sm_10’
ptxas info : Used 17 registers, 52+16 bytes smem, 32 bytes cmem[1]

If ptxas encounters R124 in this case, have you any idea why in the previously example R124 wasn’t encountered?

Thank you

code for sm_10
Function : _Z15BlackScholesGPUPfS_S_S_S_ffi
/0000/ /0x100042050023c780/ MOV.U16 R0H, g [0x1].U16;
/0008/ /0xa000000504000780/ I2I.U32.U16 R1, R0L;
/0010/ /0x60014c0100204780/ IMAD.U16 R0, g [0x6].U16, R0H, R1;
/0018/ /0x3000e1fd6c20c7c8/ ISET.S32.C0 o [0x7f], g [0x10], R0, LE;
/0020/ /0x3000000300000280/ RET C0.NE;
/0028/ /0xc1007e1103f00003/ FMUL32I R4, g [0xf], 0x3f000000;
/0030/ /*0x1100fc08 */ MOV32 R2, g [0xe];
/0034/ /*0x11002208 */ MOV32.U16 R1L, g [0x1].U16;
/0038/ /0x3002000dc4100780/ SHL R3, R0, 0x2;
/0040/ /0xe004de2100208780/ FMAD R8, g [0xf], R4, R2;
/0048/ /*0x41022804 */ IMUL32.U16.U16 R1, g [0x4].U16, R1L;
/004c/ /*0x2103f014 */ IADD32 R5, g [0x8], R3;
/0050/ /*0x2103f41c */ IADD32 R7, g [0xa], R3;
/0054/ /*0x2103f82c */ IADD32 R11, g [0xc], R3;
/0058/ /*0x2103e828 */ IADD32 R10, g [0x4], R3;
/005c/ /*0x2103ec18 */ IADD32 R6, g [0x6], R3;
/0060/ /0x30020225c4100780/ SHL R9, R1, 0x2;
/0068/ /0xd00e0a0d80c00780/ GLD.U32 R3, global14 [R5];
/0070/ /0xd00e0e1180c00780/ GLD.U32 R4, global14 [R7];
/0078/ /0xd00e163580c00780/ GLD.U32 R13, global14 [R11];
/0080/ /0xb08009fd605107c8/ FSET.C0 o [0x7f], |R4|, c [0x1] [0x0], GT;
/0088/ /*0x10008604 */ MOV32 R1, R3;
/008c/ /*0x10008808 */ MOV32 R2, R4;
/0090/ /0xc00ddc3100200780/ FMUL R12, g [0xe], R13;
/0098/ /0xc087060d00400680/ FMUL R3 (C0.NEU), R3, c [0x1] [0x7];
/00a0/ /0xc087081100400680/ FMUL R4 (C0.NEU), R4, c [0x1] [0x7];
/00a8/ /*0x90000810 */ RCP32 R4, R4;
/00ac/ /*0xc004060c */ FMUL32 R3, R3, R4;
/00b0/ /0xc03b981103fb8aa3/ FMUL32I R4, -R12, 0x3fb8aa3b;
/00b8/ /0xb0000811c0004780/ RRO R4, R4, EX2;
/00c0/ /0x90000811c0000780/ EX2 R4, R4;
/00c8/ /0xc002083100000780/ FMUL R12, R4, R2;
/00d0/ /0x9000060960000780/ LG2 R2, R3;
/00d8/ /0xc018040903f31723/ FMUL32I R2, R2, 0x3f317218;
/00e0/ /0xe0081a0d00008780/ FMAD R3, R13, R8, R2;
/00e8/ /0x90001a0940000780/ RSQ R2, R13;
/00f0/ /*0x90000408 */ RCP32 R2, R2;
/00f4/ /*0xc1027e10 */ FMUL32 R4, g [0xf], R2;
/00f8/ /0xb08009fd605107c8/ FSET.C0 o [0x7f], |R4|, c [0x1] [0x0], GT;
/0100/ /0xc087060d00400680/ FMUL R3 (C0.NEU), R3, c [0x1] [0x7];
/0108/ /0xc087081100400680/ FMUL R4 (C0.NEU), R4, c [0x1] [0x7];
/0110/ /*0x90000810 */ RCP32 R4, R4;
/0114/ /*0xc0040610 */ FMUL32 R4, R3, R4;
/0118/ /0xc00008350bf00003/ FMUL32I R13, R4, -0x41000000;
/0120/ /0x1000800d03f80003/ MVI R3, 0x3f800000;
/0128/ /0xe002de0904210780/ FMAD R2, -g [0xf], R2, R4;
/0130/ /0xb07c09fd600107c8/ FSET.C0 o [0x7f], R4, R124, GT;
/0138/ /0xc00d083900000780/ FMUL R14, R4, R13;
/0140/ /0xa0000811c4104780/ F2F.F32.F32 R4, |R4|;
/0148/ /0xe08108110040c780/ FMAD R4, R4, c [0x1] [0x1], R3;
/0150/ /0xc00004350bf00003/ FMUL32I R13, R2, -0x41000000;
/0158/ /0xb07c05fd600107d8/ FSET.C1 o [0x7f], R2, R124, GT;
/0160/ /0xc00d043500000780/ FMUL R13, R2, R13;
/0168/ /0xa0000409c4104780/ F2F.F32.F32 R2, |R2|;
/0170/ /0xe009040d03e6d33b/ FMAD32I R3, R2, 0x3e6d3389, R3;
/0178/ /0x102a80090bfe91ef/ MVI R2, -0x4016e116;
/0180/ /0x9000083d00000780/ RCP R15, R4;
/0188/ /0xe0821e1100408780/ FMAD R4, R15, c [0x1] [0x2], R2;
/0190/ /0xe1041e110040c780/ FMAD R4, R15, R4, c [0x1] [0x3];
/0198/ /0xe1041e1100410780/ FMAD R4, R15, R4, c [0x1] [0x4];
/01a0/ /0xe1041e1100414780/ FMAD R4, R15, R4, c [0x1] [0x5];
/01a8/ /0xc0041e1100000780/ FMUL R4, R15, R4;
/01b0/ /0xc03b1c3903fb8aa3/ FMUL32I R14, R14, 0x3fb8aa3b;
/01b8/ /0xb0001c39c0004780/ RRO R14, R14, EX2;
/01c0/ /0x90001c39c0000780/ EX2 R14, R14;
/01c8/ /0xc02a1c3903ecc423/ FMUL32I R14, R14, 0x3ecc422a;
/01d0/ /*0xc0041c10 */ FMUL32 R4, R14, R4;
/01d4/ /*0x9000060c */ RCP32 R3, R3;
/01d8/ /0xe02f060903faa467/ FMAD32I R2, R3, 0x3faa466f, R2;
/01e0/ /0xe10206090040c780/ FMAD R2, R3, R2, c [0x1] [0x3];
/01e8/ /0xe102060900410780/ FMAD R2, R3, R2, c [0x1] [0x4];
/01f0/ /0xe102060900414780/ FMAD R2, R3, R2, c [0x1] [0x5];
/01f8/ /0xc002060d00000780/ FMUL R3, R3, R2;
/0200/ /0xc03b1a0903fb8aa3/ FMUL32I R2, R13, 0x3fb8aa3b;
/0208/ /0xb0000409c0004780/ RRO R2, R2, EX2;
/0210/ /0x90000409c0000780/ EX2 R2, R2;
/0218/ /0xc02a040903ecc423/ FMUL32I R2, R2, 0x3ecc422a;
/0220/ /0xc003040900000780/ FMUL R2, R2, R3;
/0228/ /0xb100081104418280/ FADD R4 (C0.NE), -R4, c [0x1] [0x6];
/0230/ /0xb100040904419280/ FADD R2 (C1.NE), -R2, c [0x1] [0x6];
/0238/ /0xb000883503f80003/ FADD32I R13, -R4, 0x3f800000;
/0240/ /*0xc002180c */ FMUL32 R3, R12, R2;
/0244/ /*0xc00d0234 */ FMUL32 R13, R1, R13;
/0248/ /0xe00402050800c780/ FMAD R1, R1, R4, -R3;
/0250/ /0xb000840903f80003/ FADD32I R2, -R2, 0x3f800000;
/0258/ /0xe002180908034780/ FMAD R2, R12, R2, -R13;
/0260/ /0xd00e1405a0c00780/ GST.U32 global14 [R10], R1;
/0268/ /*0x11002208 */ MOV32.U16 R1L, g [0x1].U16;
/026c/ /*0x41022804 */ IMUL32.U16.U16 R1, g [0x4].U16, R1L;
/0270/ /0x2000000104004780/ IADD R0, R0, R1;
/0278/ /0xd00e0c09a0c00780/ GST.U32 global14 [R6], R2;
/0280/ /*0x200b922c */ IADD32 R11, R9, R11;
/0284/ /*0x2007921c */ IADD32 R7, R9, R7;
/0288/ /0x3000e1fd6c2107c8/ ISET.S32.C0 o [0x7f], g [0x10], R0, GT;
/0290/ /*0x20098a14 */ IADD32 R5, R5, R9;
/0294/ /*0x200a9228 */ IADD32 R10, R9, R10;
/0298/ /0x2000121904018780/ IADD R6, R9, R6;
/02a0/ /0x1000d00300000280/ BRA C0.NE, 0x68;
/02a8/ /0xf0000001e0000001/ NOP;

tera · November 29, 2015, 6:07pm

My assumption has always been that there is a complete set of special registers for each warp. The number of special registers is so small (compared to the number of general purpose registers) that it simply doesn’t seem worth spending the die area to make their number configurable.

I might be wrong of course.

njuffa · November 29, 2015, 7:17pm

If I recall correctly (and it has been years since last dealt with sm_1x architecture GPUs) there were four condition code registers (each with a zero, sign, carry, and overflow bit, I think), and four address registers (16-bit, I think). Address registers were spilled to GPRs if more than physically available were needed. Registers R124-R127 were never allocated (enabling their use of de-facto zero registers), so their use should not count towards the GPR register count reported by PTXAS.

I do not know why 17 registers are reported in your example code. One thing to keep in mind is that in disassembled code, some instructions may use more than one register, e.g. vector loads/stores, texture instructions, and double-precision operations. In these cases, only the lowest register of a register pair or register quad may be shown. But a cursory review of the posted disassembly doesn’t seem to show an instance of such instructions.

sepico · November 30, 2015, 10:22am

NJUFFA if you have available any official documentation or reference with the information that you provided me about that sizes please share it with me.

Thank you both NJUFFA and TERA

njuffa · November 30, 2015, 4:16pm

I doubt you will find much official vendor documentation for an architecture that was discontinued many years ago. But Google is your friend. A quick internet search lead to this unofficial documentation:

[url]https://media.readthedocs.org/pdf/envytools/latest/envytools.pdf[/url]

See page 203 for register specifications of the Tesla architecture. Since this is unofficial documentation, there is no way of telling how accurate this is, but at least it does not seem wildly inaccurate to me. It is definitely more comprehensive and possibly more accurate than my mental recollection.

I would be interested to learn why you are looking for detailed specifications for an architecture that was discontinued years ago and is no longer supported by CUDA.