(1) What version of CUDA are you using?

(2) What architecture are you compiling for?

As tera stated, CUDA 8.0 seems to be generating much better machine code than CUDA 7.x which seemed to contain a number of weirdly inefficient code generation issues. Caveat: I just installed CUDA 8.0 a week ago, but from what code I inspected, I was positively surprised by the much tighter code produced by CUDA 8. I am not sure how much of Scott Gray’s criticism from four years ago still applies today. Keep in mind that PTX is not an assembly language, it is a compiler intermediate representation and a *virtual* ISA, which is *compiled*, by an optimizing compiler (PTXAS) into machine code. Thus control of machine specific things like register usage is very limited at PTX level.

You didn’t say how you structured your multiplication code. Note that various multiplication operations available at the PTX level are *not* supported natively in hardware, *especially on Maxwell and Pascal* based GPUs where pretty much all of them are *emulated*. The emulation sequences emitted by PTXAS for some of these use up registers and instructions at a frightening rate, *especially* those one would expect to find in a multi-precision routine (.hi, .cc, madc). If you use only 32-bit multiplication instructions and PTX level and compile for the Kepler architecture (compute capability 3.x), you should see much more reasonable results. I have only tried that up to 512 bit on that, though.

It is also important how one arranges the 32x32 partial multiplication that make up the multi-precision multiply in a scheme that minimizes live registers. A good example of an efficient arrangement (albeit only for 128 bits) can be seen in an old StackOverflow post of mine (http://stackoverflow.com/a/6220499/780717). This kind of arrangement is regular and the code can be generated by a script of program for an size that will fit the register file, unfortunately I don’t have a script available.

[Later:] For illustration purposes, I built the code from my StackOverflow posting with CUDA 8.0, compiling for sm_35 and sm_50. The code for sm_50 is about 5.5 times as long, which makes sense because emulating everything with 16x16-bit multiplies (XMAD) will require four times as many basic multiplies as when using 32x32-bit multiplies, plus overhead from additional additions. Despite the jumble of emulation code, register usage in the Maxwell code looks fine, just incrementally higher than for the Kepler code.

Really high performance multi-precision multiplication code should be written in Maxwell or Pascal machine language; however NVIDIA does not make tools available for this publicly. You might want to study this presentation and paper by NVIDIA engineers and others about modular multiplication on Maxwell:http://arith23.gforge.inria.fr/slides/Emmart_Luitjens_Weems_Woolley.pptx, <a target=’_blank’ rel=‘noopener noreferrer’ href=‘https://www.researchgate.net/profile/Niall_Emmart/publication/307941360_Optimizing_Modular_Multiplication_for_NVIDIA’s_Maxwell_GPUs/links/580f1e5f08aef766ef11c3a0.pdf’>https://www.researchgate.net/profile/Niall_Emmart/publication/307941360_Optimizing_Modular_Multiplication_for_NVIDIA’s_Maxwell_GPUs/links/580f1e5f08aef766ef11c3a0.pdf

```
Function : _Z11mul_uint1285uint4S_
.headerflags @"EF_CUDA_SM35 EF_CUDA_PTX_SM(EF_CUDA_SM35)"
/*0008*/ IMUL.U32.U32.HI R0, R4, R8;
/*0010*/ IMAD.U32.U32 R3.CC, R4, R9, R0;
/*0018*/ IMAD.U32.U32.HI.X R12, R4, R9, RZ;
/*0020*/ IMAD.U32.U32 R0.CC, R5, R8, R3;
/*0028*/ IMAD.U32.U32.HI.X R3.CC, R5, R8, R12;
/*0030*/ IMAD.U32.U32.HI.X R14, R4, R10, RZ;
/*0038*/ IMAD.U32.U32 R12.CC, R4, R10, R3;
/*0048*/ IMAD.U32.U32.HI.X R13, R5, R9, R14;
/*0050*/ IMAD.U32.U32 R3.CC, R5, R9, R12;
/*0058*/ IMAD.U32.U32.HI.X R13, R6, R8, R13;
/*0060*/ IMAD.U32.U32 R3.CC, R6, R8, R3;
/*0068*/ IMAD.U32.U32.X R11, R4, R11, R13;
/*0070*/ IMAD.U32.U32 R5, R5, R10, R11;
/*0078*/ IMAD.U32.U32 R9, R6, R9, R5;
/*0088*/ MOV R5, R0;
/*0090*/ MOV R6, R3;
/*0098*/ IMUL.U32.U32 R4, R4, R8;
/*00a0*/ IMAD.U32.U32 R7, R7, R8, R9;
/*00a8*/ RET;
/*00b0*/ NOP;
/*00b8*/ NOP;
/*00c0*/ BRA 0xc0;
Function : _Z11mul_uint1285uint4S_
.headerflags @"EF_CUDA_SM50 EF_CUDA_PTX_SM(EF_CUDA_SM50)"
/*0008*/ IADD32I R1, R1, -0x10;
/*0010*/ STL [R1+0xc], R18;
/*0018*/ STL [R1+0x8], R17;
/*0028*/ STL [R1+0x4], R16;
/*0030*/ { MOV R3, R9;
/*0038*/ STL [R1], R2; }
/*0048*/ MOV R0, R8;
/*0050*/ MOV R8, R6;
/*0058*/ MOV R9, R5;
/*0068*/ MOV R6, R4;
/*0070*/ XMAD R5, R6.reuse, R0.reuse, RZ;
/*0078*/ XMAD.MRG R4, R6.reuse, R0.H1.reuse, RZ;
/*0088*/ XMAD R12, R6.reuse, R0.H1.reuse, RZ;
/*0090*/ XMAD R13, R6.H1.reuse, R0.H1, RZ;
/*0098*/ XMAD R15, R6, R3, RZ;
/*00a8*/ XMAD.MRG R16, R6.reuse, R3.H1.reuse, RZ;
/*00b0*/ XMAD.CHI R14, R6.H1.reuse, R0, R5.reuse;
/*00b8*/ XMAD.PSL.CBCC R4, R6.H1.reuse, R4.H1, R5;
/*00c8*/ XMAD R17, R6.reuse, R3.H1.reuse, RZ;
/*00d0*/ XMAD R18, R6.H1.reuse, R3.H1, RZ;
/*00d8*/ XMAD.PSL.CBCC R2, R6.H1, R16.H1, R15;
/*00e8*/ IADD3.RS R5, R14, R12, R13;
/*00f0*/ XMAD.CHI R15, R6.H1, R3, R15;
/*00f8*/ XMAD R14, R9.reuse, R0.reuse, RZ;
/*0108*/ XMAD.MRG R12, R9.reuse, R0.H1.reuse, RZ;
/*0110*/ XMAD R13, R9, R0.H1, RZ;
/*0118*/ IADD R5.CC, R5, R2;
/*0128*/ { IADD3.RS R17, R15, R17, R18;
/*0130*/ LDL R18, [R1+0xc]; }
/*0138*/ XMAD.CHI R2, R9.H1.reuse, R0.reuse, R14.reuse;
/*0148*/ XMAD R15, R9.H1.reuse, R0.H1, RZ;
/*0150*/ XMAD.PSL.CBCC R14, R9.H1.reuse, R12.H1, R14;
/*0158*/ XMAD R16, R9, R3, RZ;
/*0168*/ IADD.X R12, RZ, R17;
/*0170*/ IADD3.RS R13, R2, R13, R15;
/*0178*/ IADD R5.CC, R5, R14;
/*0188*/ XMAD R14, R9.reuse, R3.H1.reuse, RZ;
/*0190*/ XMAD R15, R9.H1.reuse, R3.H1.reuse, RZ;
/*0198*/ XMAD.CHI R17, R9.H1, R3, R16;
/*01a8*/ IADD3.RS R2, R17, R14, R15;
/*01b0*/ XMAD.MRG R14, R9.reuse, R3.H1, RZ;
/*01b8*/ XMAD R15, R6.reuse, R10.reuse, RZ;
/*01c8*/ IADD.X R12.CC, R12, R13;
/*01d0*/ XMAD R13, R6.reuse, R10.H1.reuse, RZ;
/*01d8*/ XMAD.PSL.CBCC R17, R9.H1, R14.H1, R16;
/*01e8*/ XMAD R14, R6.H1, R10.H1, RZ;
/*01f0*/ XMAD.CHI R16, R6.H1.reuse, R10.reuse, R15;
/*01f8*/ IADD3.RS R16, R16, R13, R14;
/*0208*/ XMAD.MRG R13, R6, R10.H1, RZ;
/*0210*/ XMAD.PSL.CBCC R14, R6.H1, R13.H1, R15;
/*0218*/ IADD.X R13, RZ, R16;
/*0228*/ IADD R12.CC, R12, R14;
/*0230*/ XMAD R14, R8, R0, RZ;
/*0238*/ XMAD R15, R8.reuse, R0.H1.reuse, RZ;
/*0248*/ IADD.X R13, R13, R2;
/*0250*/ XMAD R2, R8.H1.reuse, R0.H1.reuse, RZ;
/*0258*/ XMAD.CHI R16, R8.H1.reuse, R0.reuse, R14;
/*0268*/ IADD R12.CC, R12, R17;
/*0270*/ XMAD.MRG R17, R8, R0.H1, RZ;
/*0278*/ IADD3.RS R15, R16, R15, R2;
/*0288*/ XMAD R16, R6.reuse, R11.reuse, RZ;
/*0290*/ XMAD.MRG R11, R6.reuse, R11.H1, RZ;
/*0298*/ { XMAD.PSL.CBCC R14, R8.H1, R17.H1, R14;
/*02a8*/ LDL R17, [R1+0x8]; }
/*02b0*/ { XMAD.PSL.CBCC R2, R6.H1, R11.H1, R16;
/*02b8*/ LDL R16, [R1+0x4]; }
/*02c8*/ IADD.X R11, R13, R15;
/*02d0*/ IADD R6.CC, R12, R14;
/*02d8*/ XMAD.MRG R12, R9.reuse, R10.H1.reuse, RZ;
/*02e8*/ { IADD.X R11, R11, R2;
/*02f0*/ LDL R2, [R1]; }
/*02f8*/ XMAD R10, R9.reuse, R10, R11;
/*0308*/ XMAD.PSL.CBCC R9, R9.H1, R12.H1, R10;
/*0310*/ XMAD.MRG R10, R8, R3.H1, RZ;
/*0318*/ XMAD R3, R8.reuse, R3, R9;
/*0328*/ XMAD.PSL.CBCC R9, R8.H1, R10.H1, R3;
/*0330*/ XMAD.MRG R10, R7.reuse, R0.H1.reuse, RZ;
/*0338*/ XMAD R0, R7, R0, R9;
/*0348*/ IADD32I R1, R1, 0x10;
/*0350*/ { XMAD.PSL.CBCC R7, R7.H1, R10.H1, R0;
/*0358*/ RET; }
/*0368*/ BRA 0x360;
```