Determining registers holding the data after executing LDG.E.128

giving a small kernel as below:

__global__ void test(uint4* input)
{
    uint4 x = input[threadIdx.x];
    x.x += 1;
    x.y += 2;
    x.z += 3;
    x.w += 4;
    input[threadIdx.x] = x;
}

the SASS code would be:

	code for sm_86
		Function : _Z4testP5uint4
	.headerflags    @"EF_CUDA_SM86 EF_CUDA_PTX_SM(EF_CUDA_SM86)"
        /*0000*/                   IMAD.MOV.U32 R1, RZ, RZ, c[0x0][0x28] ;    /* 0x00000a00ff017624 */
                                                                              /* 0x000fce00078e00ff */
        /*0010*/                   S2R R2, SR_TID.X ;                         /* 0x0000000000027919 */
                                                                              /* 0x000e220000002100 */
        /*0020*/                   IMAD.MOV.U32 R3, RZ, RZ, 0x10 ;            /* 0x00000010ff037424 */
                                                                              /* 0x000fe200078e00ff */
        /*0030*/                   ULDC.64 UR4, c[0x0][0x118] ;               /* 0x0000460000047ab9 */
                                                                              /* 0x000fc60000000a00 */
        /*0040*/                   IMAD.WIDE.U32 R2, R2, R3, c[0x0][0x160] ;  /* 0x0000580002027625 */
                                                                              /* 0x001fca00078e0003 */
        /*0050*/                   LDG.E.128 R4, [R2.64] ;                    /* 0x0000000402047981 */
                                                                              /* 0x000ea4000c1e1d00 */
        /*0060*/                   IADD3 R7, R7, 0x4, RZ ;                    /* 0x0000000407077810 */
                                                                              /* 0x004fe40007ffe0ff */
        /*0070*/                   IADD3 R6, R6, 0x3, RZ ;                    /* 0x0000000306067810 */
                                                                              /* 0x000fe40007ffe0ff */
        /*0080*/                   IADD3 R5, R5, 0x2, RZ ;                    /* 0x0000000205057810 */
                                                                              /* 0x000fe40007ffe0ff */
        /*0090*/                   IADD3 R4, R4, 0x1, RZ ;                    /* 0x0000000104047810 */
                                                                              /* 0x000fca0007ffe0ff */
        /*00a0*/                   STG.E.128 [R2.64], R4 ;                    /* 0x0000000402007986 */
                                                                              /* 0x000fe2000c101d04 */
        /*00b0*/                   EXIT ;                                     /* 0x000000000000794d */

base on this line LDG.E.128 R4, [R2.64] and the context, we can guess that the data are loaded into r4,r5,r6,r7.
Is it true in all scenarios that LDG.E.128 RX, [ADDRESS] will store data into RX, R(X+1), R(X+2), R(X+3)? if yes, is it the same for LDG.E.64 ? If “it depends” on sm_, then is there any documentation about it?
Thanks.

AFAIK, SASS is not documented to this level of description. However I believe your understanding is correct and AFAIK there are no architectural differences. A 128-bit load/store uses 4 registers beginning at the one indicated. A 64-bit load/store uses 2 registers beginning at the one indicated.

Thanks for the confirmation.
Apparently registers space is also “aligned”, with the instruction LDG.E.128 RX, X is always divisible by 4.

That has always been my observation, when I have studied SASS with 128-bit loads/stores.

Since register operands of double-precision operations must use an even-odd register pair, i.e. register number shown by SASS is divisible by 2, this seems entirely consistent with wide load operations. I have never observed any differences in this “natural register alignment” behavior across all existing GPU architectures.

1 Like

@njuffa @Robert_Crovella not relevant to the topic, but do you know any IDE or plugins that support writing a kernel in PTX (error highlights, code suggestion, register count, etc)?
I’m optimizing some big complex kernels that use more than 150 registers and lots of predicates. I just use sublime (a plain text editor) for now. Mistakes were made, and it seems hard to debug, takes more time than I expected.