Determining registers holding the data after executing LDG.E.128

lqhuy93 · January 4, 2021, 4:11pm

giving a small kernel as below:

__global__ void test(uint4* input)
{
    uint4 x = input[threadIdx.x];
    x.x += 1;
    x.y += 2;
    x.z += 3;
    x.w += 4;
    input[threadIdx.x] = x;
}

the SASS code would be:

	code for sm_86
		Function : _Z4testP5uint4
	.headerflags    @"EF_CUDA_SM86 EF_CUDA_PTX_SM(EF_CUDA_SM86)"
        /*0000*/                   IMAD.MOV.U32 R1, RZ, RZ, c[0x0][0x28] ;    /* 0x00000a00ff017624 */
                                                                              /* 0x000fce00078e00ff */
        /*0010*/                   S2R R2, SR_TID.X ;                         /* 0x0000000000027919 */
                                                                              /* 0x000e220000002100 */
        /*0020*/                   IMAD.MOV.U32 R3, RZ, RZ, 0x10 ;            /* 0x00000010ff037424 */
                                                                              /* 0x000fe200078e00ff */
        /*0030*/                   ULDC.64 UR4, c[0x0][0x118] ;               /* 0x0000460000047ab9 */
                                                                              /* 0x000fc60000000a00 */
        /*0040*/                   IMAD.WIDE.U32 R2, R2, R3, c[0x0][0x160] ;  /* 0x0000580002027625 */
                                                                              /* 0x001fca00078e0003 */
        /*0050*/                   LDG.E.128 R4, [R2.64] ;                    /* 0x0000000402047981 */
                                                                              /* 0x000ea4000c1e1d00 */
        /*0060*/                   IADD3 R7, R7, 0x4, RZ ;                    /* 0x0000000407077810 */
                                                                              /* 0x004fe40007ffe0ff */
        /*0070*/                   IADD3 R6, R6, 0x3, RZ ;                    /* 0x0000000306067810 */
                                                                              /* 0x000fe40007ffe0ff */
        /*0080*/                   IADD3 R5, R5, 0x2, RZ ;                    /* 0x0000000205057810 */
                                                                              /* 0x000fe40007ffe0ff */
        /*0090*/                   IADD3 R4, R4, 0x1, RZ ;                    /* 0x0000000104047810 */
                                                                              /* 0x000fca0007ffe0ff */
        /*00a0*/                   STG.E.128 [R2.64], R4 ;                    /* 0x0000000402007986 */
                                                                              /* 0x000fe2000c101d04 */
        /*00b0*/                   EXIT ;                                     /* 0x000000000000794d */

base on this line LDG.E.128 R4, [R2.64] and the context, we can guess that the data are loaded into r4,r5,r6,r7.
Is it true in all scenarios that LDG.E.128 RX, [ADDRESS] will store data into RX, R(X+1), R(X+2), R(X+3)? if yes, is it the same for LDG.E.64 ? If “it depends” on sm_, then is there any documentation about it?
Thanks.

Robert_Crovella · January 4, 2021, 5:07pm

AFAIK, SASS is not documented to this level of description. However I believe your understanding is correct and AFAIK there are no architectural differences. A 128-bit load/store uses 4 registers beginning at the one indicated. A 64-bit load/store uses 2 registers beginning at the one indicated.

lqhuy93 · January 4, 2021, 5:16pm

Thanks for the confirmation.
Apparently registers space is also “aligned”, with the instruction LDG.E.128 RX, X is always divisible by 4.

Robert_Crovella · January 4, 2021, 5:27pm

That has always been my observation, when I have studied SASS with 128-bit loads/stores.

njuffa · January 5, 2021, 1:11am

Since register operands of double-precision operations must use an even-odd register pair, i.e. register number shown by SASS is divisible by 2, this seems entirely consistent with wide load operations. I have never observed any differences in this “natural register alignment” behavior across all existing GPU architectures.

lqhuy93 · January 5, 2021, 10:34am

@njuffa @Robert_Crovella not relevant to the topic, but do you know any IDE or plugins that support writing a kernel in PTX (error highlights, code suggestion, register count, etc)?
I’m optimizing some big complex kernels that use more than 150 registers and lots of predicates. I just use sublime (a plain text editor) for now. Mistakes were made, and it seems hard to debug, takes more time than I expected.

Topic		Replies	Views
Understanding LDSM.16.M88.4 CUDA Programming and Performance	6	856	June 8, 2024
SASS, LDS.128, LD.128 and DRAM allocation CUDA Programming and Performance	7	4106	June 23, 2016
About LD instruction for wmma CUDA Programming and Performance	2	604	July 5, 2023
[Solved]SASS Code Analysis CUDA Programming and Performance	5	8660	November 30, 2017
How to use the LDS.128 CUDA Programming and Performance	1	1722	December 27, 2012
Unexpected LDG operations CUDA Programming and Performance	4	574	July 23, 2020
Maxwell (sm_50) instruction: LDG.E ? CUDA Programming and Performance	25	9188	August 15, 2015
How to write inline asm function sts128 (store 128 bits to shared memory)? CUDA Programming and Performance cuda	14	1134	November 10, 2023
LDS.128 loads from shared memory CUDA Programming and Performance	3	757	September 11, 2023
Improper use of __ldg() causes illegal memory access CUDA Programming and Performance	9	2250	January 31, 2015

Determining registers holding the data after executing LDG.E.128

Related topics