CUDA low-level programming - strange ptxas behavior

Hello everybody.

I’ve recently come across something very strange when I decompiled with cuobjdump two different short *.ptx snippsets, that should, in theory, occupy the same number of registers. Let’s consider the following piece of code responsible for adding two vectors of the arbitrary length:

  • addKernel.ptx
  • .version 1.4
    .target sm_11
    
    .tex .u32 tex1;
    .tex .u32 tex2;
    
    .entry addKernel (
        .param .b32 size,
        .param .b32 devOutArr1
    ) {
        .reg .pred p0;
        .reg .b32 r1;
        .reg .b32 r2;
        .reg .b32 r3;
        .reg .b32 r4;
    $LDWbegin_addKernel:
        mov.b32 r1, %ctaid.x;
        mov.b32 r2, %ntid.x;
        mul24.lo.s32 r1, r1, r2;
        mov.b32 r2, %tid.x;
        add.s32 r1, r1, r2;
        mov.b32 r4, r1;
        ld.param.s32 r2, ;
        setp.lt.s32 p0, r1, r2;
        @p0 tex.1d.v4.s32.s32 { r2, r3, r3, r3 }, [tex1, { r4 }];
        /*
            @p0 mov.f32 r3, r2;
            @p0 add.f32 r2, r2, r3;
            @p0 mov.f32 r3, r2;
            @p0 add.f32 r2, r2, r3;
            @p0 mov.f32 r3, r2;
            @p0 add.f32 r2, r2, r3;
        */
        @p0 tex.1d.v4.s32.s32 { r3, r4, r4, r4 }, [tex2, { r4 }];
        @p0 add.f32 r2, r2, r3;
        @p0 ld.param.b32 r3, [devOutArr1];
        @p0 shl.b32 r1, r1, 2;
        @p0 add.u32 r1, r1, r3;
        @p0 st.global.b32 [r1], r2;
        exit;
    $LDWend_addKernel:
    }
    
  • addKernel.sass
  • code for sm_11
    		Function : addKernel
    	.headerflags    @"EF_CUDA_SM10 EF_CUDA_PTX_SM(EF_CUDA_SM10)"
            /*0000*/        I2I.U32.U16 R1, g [0x6].U16;               /* 0x04200780a0004c05 */
            /*0008*/        I2I.U32.U16 R3, g [0x1].U16;               /* 0x04200780a000420d */
            /*0010*/        I2I.U32.U16 R0, R0L;                       /* 0x04000780a0000001 */
            /*0018*/        IMAD.S24 R4, R1, R3, R0;                   /* 0x8000078060030211 */
            /*0020*/        ISET.S32.C0 R3, g [0x4], R4, GT;           /* 0x6c2107c03004c80d */
            /*0028*/        MOV32 R1, R4;                              /* 0x10008804         */
            /*002c*/        MOV32 R0, R4;                              /* 0x10008800         */
            /*0030*/        I2I.S32.S32.C1 o[0x7f], R3;                /* 0x0c0147d8a00007fd */
            /*0038*/        TEX.UN.NODEP R1, 0x1, 0x1, 0x0, RXXX, 0x0; /* 0x00000784f3020205 */
            /*0040*/        TEX.UN.NODEP R0, 0x0, 0x0, 0x0, RXXX, 0x0; /* 0x00000784f3000001 */
            /*0048*/        MOV R1 (C1.EQU), R2;                       /* 0x0403d50010000405 */
            /*0050*/        G2R.U32 R3, g [0x4].U32;                   /* 0x4400c7801000080d */
            /*0058*/        G2R.U32 R0 (C1.EQU), g [0x4].U32;          /* 0x4400d50010000801 */
            /*0060*/        FADD R3 (C1.NE), R0, R1;                   /* 0x00005280b000000d */
            /*0068*/        MOV32 R2, R1;                              /* 0x10008208         */
            /*006c*/        MOV32 R0, R4;                              /* 0x10008800         */
            /*0070*/        G2R.U32 R2 (C1.NEU), g [0x5].U32;          /* 0x4400d68010000a09 */
            /*0078*/        MOV R1, R0;                                /* 0x0403c78010000005 */
            /*0080*/        SHL R1 (C0.NE), R0, 0x2;                   /* 0xc410028030020005 */
            /*0088*/        IADD R0 (C0.NE), R1, R2;                   /* 0x0400828020000201 */
            /*0090*/        RET C0.EQU;                                /* 0x0000050030000003 */
            /*0098*/        GST.U32 global14[R0], R3;                  /* 0xa0c00781d00e000d */
    		..........................
    

    As we expected, the generated *.sass code uses only 5 registers. Now, for a change, let’s modify the addKernel.ptx snippset a bit by uncommenting the commented lines. In the spite of the modified version of code should easily fit into 5 registers, ptxas during “optimization” generates something that uses exactly the 7 ones:

  • SASS code for modified addKernel.ptx
  • code for sm_11
    		Function : addKernel
    	.headerflags    @"EF_CUDA_SM10 EF_CUDA_PTX_SM(EF_CUDA_SM10)"
            /*0000*/        I2I.U32.U16 R1, g [0x6].U16;               /* 0x04200780a0004c05 */
            /*0008*/        I2I.U32.U16 R3, g [0x1].U16;               /* 0x04200780a000420d */
            /*0010*/        I2I.U32.U16 R0, R0L;                       /* 0x04000780a0000001 */
            /*0018*/        IMAD.S24 R6, R1, R3, R0;                   /* 0x8000078060030219 */
            /*0020*/        ISET.S32.C0 R0, g [0x4], R6, GT;           /* 0x6c2107c03006c801 */
            /*0028*/        MOV R1, R6;                                /* 0x0403c78010000c05 */
            /*0030*/        I2I.S32.S32.C1 o[0x7f], R0;                /* 0x0c0147d8a00001fd */
            /*0038*/        MOV32 R5, R2;                              /* 0x10008414         */
            /*003c*/        MOV32 R0, R6;                              /* 0x10008c00         */
            /*0040*/        TEX.UN.NODEP R1, 0x0, 0x0, 0x0, XXXA, 0x0; /* 0x00008784f1000005 */
            /*0048*/        TEX.UN.NODEP R0, 0x1, 0x1, 0x0, RXXX, 0x0; /* 0x00000784f3020201 */
            /*0050*/        G2R.U32 R4, g [0x4].U32;                   /* 0x4400c78010000811 */
            /*0058*/        MOV R5 (C1.NE), R1;                        /* 0x0403d28010000215 */
            /*0060*/        MOV R4 (C1.NE), R1;                        /* 0x0403d28010000211 */
            /*0068*/        G2R.U32 R3, g [0x4].U32;                   /* 0x4400c7801000080d */
            /*0070*/        FADD R3 (C1.NE), R4, R5;                   /* 0x00015280b000080d */
            /*0078*/        MOV R0 (C1.EQU), R2;                       /* 0x0403d50010000401 */
            /*0080*/        G2R.U32 R1, g [0x4].U32;                   /* 0x4400c78010000805 */
            /*0088*/        FADD R1 (C1.NE), R3, R0;                   /* 0x00001280b0000605 */
            /*0090*/        MOV32 R2, R0;                              /* 0x10008008         */
            /*0094*/        MOV32 R0, R6;                              /* 0x10008c00         */
            /*0098*/        G2R.U32 R2 (C1.NEU), g [0x5].U32;          /* 0x4400d68010000a09 */
            /*00a0*/        MOV R3, R0;                                /* 0x0403c7801000000d */
            /*00a8*/        SHL R3 (C0.NE), R0, 0x2;                   /* 0xc41002803002000d */
            /*00b0*/        IADD R0 (C0.NE), R3, R2;                   /* 0x0400828020000601 */
            /*00b8*/        RET C0.EQU;                                /* 0x0000050030000003 */
            /*00c0*/        GST.U32 global14[R0], R1;                  /* 0xa0c00781d00e0005 */
    		..........................
    

    I tested the above mentioned pieces of code on the newest CUDA drivers and the NVIDIA Nsight applicable for my 9600M video card. It looks like there is no point optimizing the code by rewriting the entire kernel in the CUDA PTX ISA as the ptxas can spoil even the optimal one (in the sense of the number of used registers).

    As far as I know one could load the generated *.cubin module directly via Driver API with using

    CUresult cuModuleLoadData (CUmodule *module, const void *image)
    

    function, but the question now is, are there any tools that can convert the *.ptx code to *.cubin LITERALLY (i.e. without the redundant “optimization”) or maybe is there a way to prevent ptxas from optimizing the code? What about the SASS instruction set reference with the detailed description of instruction format and encoding (something like Intel Architecture Software Developer’s Manual)? As a last resort, creating the *.cubin with the Hex Editor may also be the allowable solution.

    Btw. I’m so care about the optimization issues, because I develop the realtime CUDA ray tracer and every saved cycle is worth his weight in gold.

    Thanks in advance.

    I’m pretty sure there is no pressure on ptxas to assign fewer than: number of registers per multiprocessor / max resident threads per multiprocessor. That number has gone from 16 to ~20 to 32 in the last three major compute revisions.

    If you were using “launch bounds” in your CUDA C code or “performance-tuning directives” in your PTX and there was a tradeoff between meeting your directive and performance then ptxas would probably exhibit an allocation closer to what you were expecting.

    There are reasons for ptxas to use more registers than we would expect. I’m sure there are plenty of architectural details that we are not aware of. Some of them have come up in the CUDA forums over the last few years.

    Thanks for Your answer. Maybe it’s not the rule, but I noticed, that when I compiled my modified version of addKernel.ptx with

    ptxas -arch=sm_11 -m32 -maxrregcount=5 "addKernel.ptx" -o "addKernel.cubin"
    

    and dumped the generated *.cubin to the *.sass code, the code used exactly 5 registers, what I expected. Moreover, there weren’t any extra loads / stores from / to the local memory. As I mentioned above, it’s obvliously not a cure-all, but following those steps (i.e. using as little temporary registers in *.ptx as possible and then compiling with -maxrregcount set to that number) may considerably reduce the register overhead.

    Btw. I’ve just found something, that turned out to be useful to a certain extent: http://docs.nvidia.com/cuda/cuda-binary-utilities/index.html#gt200