Output code uses far too many registers

I wrote a function almost entirely in PTX. It uses 70 registers: 64 for the 2048-bit number I’m multiplying, and 6 for other stuff.

However, when it gets compiled, it blows out to the maximum 255 registers, and even spills into local memory. Looking at the disassembly, the “optimizer” has done horrible things to my code, using way too many registers, and spilling out to the stack all over the place.

If the compiler simply allocated registers in the way that I wrote the PTX code, rather than attempt to optimize it, it would work great, and I’d be able to run double the threads per block. However, disabling the optimizer just makes it worse.

Is there any thing I can do about this? Will I have to write my code in the “raw” assembly language rather than PTX in order to prevent this?

Thanks,

Melissa

both nvcc and ptxas have options to force the compiler to limit register usage:

[url]http://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#ptxas-options[/url]

I should have mentioned that I tried limiting to 128 registers, and all it did was make the compiler spill even more stuff to the stack.

Another annoying behavior is that statements that use constants as immediate values get compiled into code that loads all the immediate values into registers up front for the whole function, resulting in a huge increase in register count. If the compiler would just load the immediate value right as I requested them, the register count would go down by a lot.

Having been there myself a couple of times, I can understand your frustration about being unable to influence ptxas’ register allocation. Unfortunately I haven’t really found a way around it.

In my experience ptxas doesn’t seem to profit from “doing register allocation for it”. It seems to work best with SSA form as that is what cicc produces.

I too am surprised about the loading of constants into registers at the very beginning, as some sass instructions seem to be able to use immediate constants directly from constant space. Can you give an example if code where this happens?

Can you move some of required ‘scratchpad’ memory to shared memory?
Both Maxwell and Pascal (other than the Tesla P100) allow for as much as 96,000 bytes of shared memory per SM.

Not sure if that is possible for you application, but I have done this successfully in the past.

Other than that maybe breaking down the computation into different sub-process kernels which use less registers and are called in the appropriate order.

Just out of curiosity (not necessarily directed at the OP) why would one want to use PTX anyway?

In my experience I am able to get very good control over the compilation using CUDA and the compilation flags.

Of course Scott Gray has been able to get superior performance over CUDA, but he uses assembly.

His relevant quote;

“Ptxas badly manages register usage (especially when you’re using vector memory ops), does a poor job of hiding memory latency with interleaved computation (particularly if you’re trying to double buffer your memory loads) and handles certain predicated memory operations badly (even when warp uniform), among other things.”

That was my first thought as well.

Also, which CUDA version are you using? CUDA 8.0 seems to work a lot better than previous versions.

(1) What version of CUDA are you using?
(2) What architecture are you compiling for?

As tera stated, CUDA 8.0 seems to be generating much better machine code than CUDA 7.x which seemed to contain a number of weirdly inefficient code generation issues. Caveat: I just installed CUDA 8.0 a week ago, but from what code I inspected, I was positively surprised by the much tighter code produced by CUDA 8. I am not sure how much of Scott Gray’s criticism from four years ago still applies today. Keep in mind that PTX is not an assembly language, it is a compiler intermediate representation and a virtual ISA, which is compiled, by an optimizing compiler (PTXAS) into machine code. Thus control of machine specific things like register usage is very limited at PTX level.

You didn’t say how you structured your multiplication code. Note that various multiplication operations available at the PTX level are not supported natively in hardware, especially on Maxwell and Pascal based GPUs where pretty much all of them are emulated. The emulation sequences emitted by PTXAS for some of these use up registers and instructions at a frightening rate, especially those one would expect to find in a multi-precision routine (.hi, .cc, madc). If you use only 32-bit multiplication instructions and PTX level and compile for the Kepler architecture (compute capability 3.x), you should see much more reasonable results. I have only tried that up to 512 bit on that, though.

It is also important how one arranges the 32x32 partial multiplication that make up the multi-precision multiply in a scheme that minimizes live registers. A good example of an efficient arrangement (albeit only for 128 bits) can be seen in an old StackOverflow post of mine (http://stackoverflow.com/a/6220499/780717). This kind of arrangement is regular and the code can be generated by a script of program for an size that will fit the register file, unfortunately I don’t have a script available.

[Later:] For illustration purposes, I built the code from my StackOverflow posting with CUDA 8.0, compiling for sm_35 and sm_50. The code for sm_50 is about 5.5 times as long, which makes sense because emulating everything with 16x16-bit multiplies (XMAD) will require four times as many basic multiplies as when using 32x32-bit multiplies, plus overhead from additional additions. Despite the jumble of emulation code, register usage in the Maxwell code looks fine, just incrementally higher than for the Kepler code.

Really high performance multi-precision multiplication code should be written in Maxwell or Pascal machine language; however NVIDIA does not make tools available for this publicly. You might want to study this presentation and paper by NVIDIA engineers and others about modular multiplication on Maxwell:http://arith23.gforge.inria.fr/slides/Emmart_Luitjens_Weems_Woolley.pptx, <a target=‘_blank’ rel=‘noopener noreferrer’ href='https://www.researchgate.net/profile/Niall_Emmart/publication/307941360_Optimizing_Modular_Multiplication_for_NVIDIA’s_Maxwell_GPUs/links/580f1e5f08aef766ef11c3a0.pdf’>https://www.researchgate.net/profile/Niall_Emmart/publication/307941360_Optimizing_Modular_Multiplication_for_NVIDIA’s_Maxwell_GPUs/links/580f1e5f08aef766ef11c3a0.pdf

Function : _Z11mul_uint1285uint4S_
.headerflags    @"EF_CUDA_SM35 EF_CUDA_PTX_SM(EF_CUDA_SM35)"

/*0008*/                   IMUL.U32.U32.HI R0, R4, R8;
/*0010*/                   IMAD.U32.U32 R3.CC, R4, R9, R0;
/*0018*/                   IMAD.U32.U32.HI.X R12, R4, R9, RZ;
/*0020*/                   IMAD.U32.U32 R0.CC, R5, R8, R3;
/*0028*/                   IMAD.U32.U32.HI.X R3.CC, R5, R8, R12;
/*0030*/                   IMAD.U32.U32.HI.X R14, R4, R10, RZ;
/*0038*/                   IMAD.U32.U32 R12.CC, R4, R10, R3;

/*0048*/                   IMAD.U32.U32.HI.X R13, R5, R9, R14;
/*0050*/                   IMAD.U32.U32 R3.CC, R5, R9, R12;
/*0058*/                   IMAD.U32.U32.HI.X R13, R6, R8, R13;
/*0060*/                   IMAD.U32.U32 R3.CC, R6, R8, R3;
/*0068*/                   IMAD.U32.U32.X R11, R4, R11, R13;
/*0070*/                   IMAD.U32.U32 R5, R5, R10, R11;
/*0078*/                   IMAD.U32.U32 R9, R6, R9, R5;
/*0088*/                   MOV R5, R0;
/*0090*/                   MOV R6, R3;
/*0098*/                   IMUL.U32.U32 R4, R4, R8;
/*00a0*/                   IMAD.U32.U32 R7, R7, R8, R9;
/*00a8*/                   RET;
/*00b0*/                   NOP;
/*00b8*/                   NOP;
/*00c0*/                   BRA 0xc0;

Function : _Z11mul_uint1285uint4S_
.headerflags    @"EF_CUDA_SM50 EF_CUDA_PTX_SM(EF_CUDA_SM50)"

/*0008*/                   IADD32I R1, R1, -0x10;
/*0010*/                   STL [R1+0xc], R18;
/*0018*/                   STL [R1+0x8], R17;

/*0028*/                   STL [R1+0x4], R16;
/*0030*/         {         MOV R3, R9;
/*0038*/                   STL [R1], R2;        }

/*0048*/                   MOV R0, R8;
/*0050*/                   MOV R8, R6;
/*0058*/                   MOV R9, R5;

/*0068*/                   MOV R6, R4;
/*0070*/                   XMAD R5, R6.reuse, R0.reuse, RZ;
/*0078*/                   XMAD.MRG R4, R6.reuse, R0.H1.reuse, RZ;

/*0088*/                   XMAD R12, R6.reuse, R0.H1.reuse, RZ;
/*0090*/                   XMAD R13, R6.H1.reuse, R0.H1, RZ;
/*0098*/                   XMAD R15, R6, R3, RZ;

/*00a8*/                   XMAD.MRG R16, R6.reuse, R3.H1.reuse, RZ;
/*00b0*/                   XMAD.CHI R14, R6.H1.reuse, R0, R5.reuse;
/*00b8*/                   XMAD.PSL.CBCC R4, R6.H1.reuse, R4.H1, R5;

/*00c8*/                   XMAD R17, R6.reuse, R3.H1.reuse, RZ;
/*00d0*/                   XMAD R18, R6.H1.reuse, R3.H1, RZ;
/*00d8*/                   XMAD.PSL.CBCC R2, R6.H1, R16.H1, R15;

/*00e8*/                   IADD3.RS R5, R14, R12, R13;
/*00f0*/                   XMAD.CHI R15, R6.H1, R3, R15;
/*00f8*/                   XMAD R14, R9.reuse, R0.reuse, RZ;

/*0108*/                   XMAD.MRG R12, R9.reuse, R0.H1.reuse, RZ;
/*0110*/                   XMAD R13, R9, R0.H1, RZ;
/*0118*/                   IADD R5.CC, R5, R2;

/*0128*/         {         IADD3.RS R17, R15, R17, R18;
/*0130*/                   LDL R18, [R1+0xc];        }
/*0138*/                   XMAD.CHI R2, R9.H1.reuse, R0.reuse, R14.reuse;

/*0148*/                   XMAD R15, R9.H1.reuse, R0.H1, RZ;
/*0150*/                   XMAD.PSL.CBCC R14, R9.H1.reuse, R12.H1, R14;
/*0158*/                   XMAD R16, R9, R3, RZ;

/*0168*/                   IADD.X R12, RZ, R17;
/*0170*/                   IADD3.RS R13, R2, R13, R15;
/*0178*/                   IADD R5.CC, R5, R14;

/*0188*/                   XMAD R14, R9.reuse, R3.H1.reuse, RZ;
/*0190*/                   XMAD R15, R9.H1.reuse, R3.H1.reuse, RZ;
/*0198*/                   XMAD.CHI R17, R9.H1, R3, R16;

/*01a8*/                   IADD3.RS R2, R17, R14, R15;
/*01b0*/                   XMAD.MRG R14, R9.reuse, R3.H1, RZ;
/*01b8*/                   XMAD R15, R6.reuse, R10.reuse, RZ;

/*01c8*/                   IADD.X R12.CC, R12, R13;
/*01d0*/                   XMAD R13, R6.reuse, R10.H1.reuse, RZ;
/*01d8*/                   XMAD.PSL.CBCC R17, R9.H1, R14.H1, R16;

/*01e8*/                   XMAD R14, R6.H1, R10.H1, RZ;
/*01f0*/                   XMAD.CHI R16, R6.H1.reuse, R10.reuse, R15;
/*01f8*/                   IADD3.RS R16, R16, R13, R14;

/*0208*/                   XMAD.MRG R13, R6, R10.H1, RZ;
/*0210*/                   XMAD.PSL.CBCC R14, R6.H1, R13.H1, R15;
/*0218*/                   IADD.X R13, RZ, R16;

/*0228*/                   IADD R12.CC, R12, R14;
/*0230*/                   XMAD R14, R8, R0, RZ;
/*0238*/                   XMAD R15, R8.reuse, R0.H1.reuse, RZ;

/*0248*/                   IADD.X R13, R13, R2;
/*0250*/                   XMAD R2, R8.H1.reuse, R0.H1.reuse, RZ;
/*0258*/                   XMAD.CHI R16, R8.H1.reuse, R0.reuse, R14;

/*0268*/                   IADD R12.CC, R12, R17;
/*0270*/                   XMAD.MRG R17, R8, R0.H1, RZ;
/*0278*/                   IADD3.RS R15, R16, R15, R2;

/*0288*/                   XMAD R16, R6.reuse, R11.reuse, RZ;
/*0290*/                   XMAD.MRG R11, R6.reuse, R11.H1, RZ;
/*0298*/         {         XMAD.PSL.CBCC R14, R8.H1, R17.H1, R14;
/*02a8*/                   LDL R17, [R1+0x8];        }

/*02b0*/         {         XMAD.PSL.CBCC R2, R6.H1, R11.H1, R16;
/*02b8*/                   LDL R16, [R1+0x4];        }

/*02c8*/                   IADD.X R11, R13, R15;
/*02d0*/                   IADD R6.CC, R12, R14;
/*02d8*/                   XMAD.MRG R12, R9.reuse, R10.H1.reuse, RZ;

/*02e8*/         {         IADD.X R11, R11, R2;
/*02f0*/                   LDL R2, [R1];        }
/*02f8*/                   XMAD R10, R9.reuse, R10, R11;

/*0308*/                   XMAD.PSL.CBCC R9, R9.H1, R12.H1, R10;
/*0310*/                   XMAD.MRG R10, R8, R3.H1, RZ;
/*0318*/                   XMAD R3, R8.reuse, R3, R9;

/*0328*/                   XMAD.PSL.CBCC R9, R8.H1, R10.H1, R3;
/*0330*/                   XMAD.MRG R10, R7.reuse, R0.H1.reuse, RZ;
/*0338*/                   XMAD R0, R7, R0, R9;

/*0348*/                   IADD32I R1, R1, 0x10;
/*0350*/         {         XMAD.PSL.CBCC R7, R7.H1, R10.H1, R0;
/*0358*/                   RET;        }

/*0368*/                   BRA 0x360;