giving a small kernel as below:
__global__ void test(uint4* input) { uint4 x = input[threadIdx.x]; x.x += 1; x.y += 2; x.z += 3; x.w += 4; input[threadIdx.x] = x; }
the SASS code would be:
code for sm_86 Function : _Z4testP5uint4 .headerflags @"EF_CUDA_SM86 EF_CUDA_PTX_SM(EF_CUDA_SM86)" /*0000*/ IMAD.MOV.U32 R1, RZ, RZ, c[0x0][0x28] ; /* 0x00000a00ff017624 */ /* 0x000fce00078e00ff */ /*0010*/ S2R R2, SR_TID.X ; /* 0x0000000000027919 */ /* 0x000e220000002100 */ /*0020*/ IMAD.MOV.U32 R3, RZ, RZ, 0x10 ; /* 0x00000010ff037424 */ /* 0x000fe200078e00ff */ /*0030*/ ULDC.64 UR4, c[0x0][0x118] ; /* 0x0000460000047ab9 */ /* 0x000fc60000000a00 */ /*0040*/ IMAD.WIDE.U32 R2, R2, R3, c[0x0][0x160] ; /* 0x0000580002027625 */ /* 0x001fca00078e0003 */ /*0050*/ LDG.E.128 R4, [R2.64] ; /* 0x0000000402047981 */ /* 0x000ea4000c1e1d00 */ /*0060*/ IADD3 R7, R7, 0x4, RZ ; /* 0x0000000407077810 */ /* 0x004fe40007ffe0ff */ /*0070*/ IADD3 R6, R6, 0x3, RZ ; /* 0x0000000306067810 */ /* 0x000fe40007ffe0ff */ /*0080*/ IADD3 R5, R5, 0x2, RZ ; /* 0x0000000205057810 */ /* 0x000fe40007ffe0ff */ /*0090*/ IADD3 R4, R4, 0x1, RZ ; /* 0x0000000104047810 */ /* 0x000fca0007ffe0ff */ /*00a0*/ STG.E.128 [R2.64], R4 ; /* 0x0000000402007986 */ /* 0x000fe2000c101d04 */ /*00b0*/ EXIT ; /* 0x000000000000794d */
base on this line LDG.E.128 R4, [R2.64]
and the context, we can guess that the data are loaded into r4,r5,r6,r7
.
Is it true in all scenarios that LDG.E.128 RX, [ADDRESS]
will store data into RX, R(X+1), R(X+2), R(X+3)
? if yes, is it the same for LDG.E.64
? If “it depends” on sm_, then is there any documentation about it?
Thanks.