How to understand the LEA assembly behind the cuda c++?

I got the assembly code with the help of nsight compute,
and no problem understanding instructions like MOV, S2R and IMAD, except for the LEA.

I tried to google the cuda docs for the LEA, but only x86 LEA was there.

See the code

// The mat is a flatten 1d array storing 2d data with row = 64 and col = 32.

__global__ void 
loadmatrix(float* mat)
{
    __shared__ float sharedmat[32*33];

    int idx = blockIdx.x*blockDim.x + threadIdx.x;
    int idy = blockIdx.y*blockDim.y + threadIdx.y;

    sharedmat[threadIdx.y * 33 + threadIdx.x] = mat[idy*32 + idx];

    __syncthreads();

    // two matrixes (32 x 32) do their own transpose
    mat[idy*32 + idx] = sharedmat[threadIdx.x * 33 + threadIdx.y];
}

The assembly code is:

00007fbe c4f9b700     MOV R1 c[0x0][0x28]
00007fbe c4f9b710     S2R R0 SR_CTAID.X
00007fbe c4f9b720     MOV R2 0x4
00007fbe c4f9b730     S2R R7 SR_TID.X
00007fbe c4f9b740     S2R R3 SR_CTAID.Y
00007fbe c4f9b750     S2R R4 SR_TID.
00007fbe c4f9b760     IMAD R0 R0 c[0x0][0x0] R7
00007fbe c4f9b770     IMAD R3 R3 c[0x0][0x4] R4
00007fbe c4f9b780     LEA R3 R3 R0 0x5
00007fbe c4f9b790     IMAD.WIDE R2 R3 R2 c[0x0][0x160]
00007fbe c4f9b7a0     LDG.E.SYS R0 [R2]
00007fbe c4f9b7b0     IMAD R5 R4 0x21 R7
00007fbe c4f9b7c0     IMAD R4 R7 0x21 R4
00007fbe c4f9b7d0     STS [R5.X4] R0
00007fbe c4f9b7e0     BAR.SYNC 0x0
00007fbe c4f9b7f0     LDS.U R7 [R4.X4]
00007fbe c4f9b800     STG.E.SYS [R2] R7
00007fbe c4f9b810     EXIT 1
00007fbe c4f9b820     BRA 0x7fbec4f9b820

I understand that two-dimensional array c may save some blockDim and threadIdx.

The registers R0 and R3 map to the global index, while the R4 and R7 maintain the local index.

The question is, what is the point of:

LEA R3, R3, R0, 0x5

I guess this line calculates the offset in the global memory by the global x and y index.

R3 = (some calculations) * 5

The c[0x0][0x160] may be the global memory entry, updated R3 is an offset, and R3 x 0x4 means the bytes for float.

But the LEA differs from the one in x86, which copies the memory value to some registers.

Would you mind telling me how the R0, R3 and 0x5 co-work for an offset?

Thank you so much!

NVIDIA does not document the SASS (machine code) instructions in detail. The official documentation simply states:

LEA Compute Effective Address

By observation, it is a shift-then-add type of instruction, with a barrel shifter first acting on the input and the result from the barrel shifter being fed to an adder. It is most frequently used to compute 64-bit addresses into an aligned register pair for 64-bit addressing (keep in mind that GPUs use a 32-bit architecture with 64-bit addressing extension). But the GPU’s LEA, just like x86’s LEA, also has utility outside of address computations as a limited, but more efficient, alternative to IMAD, and the CUDA compiler “knows” how to use it as such.

The 0x5 in your example LEA shown above is the shift count for the left shift. You will need to look at more diverse instances of LEA to completely reverse engineer its functionality. I went through this exercise once, for Turing, but do not recollect the results in detail. I would not be surprised if there are small differences between the details of LEA between Turing / Ampere / Hopper, as NVIDIA does not maintain binary compatibility between GPU architectures.

c[x][y] refers to constant memory. x denotes the constant bank (there are about four of those), y the byte offset inside that bank. If memory serves, in recent GPU architectures, constant bank 0 is used (among other things) for passing kernel arguments. c[0x0][0x160] may well represent the first kernel argument.

That is not what it does. LEA on x86 is a (severely limited) left shift followed by a three-input add. E.g. lea eax, [ecx*4 + eax + 5]. The expression in brackets is not a reference to memory. A common idiom for multiplying a register by 5 would be lea eax, [4*eax + eax].

[Later:]

From a quick look at generated LEA instructions for thesm_75 (Turing) architecture, LEA in SASS looks like so:

LEA.{LO | HI} dst, pred, a.lo, b, a.hi, imm_shift

where all quantities except imm_shift comprise 32 bits, and .lo extracts the least signficant 32 bits while .hi extracts the most signficant 32 bits of a 64-bit quantity. The disassembler defaults to .LO which is therefore displayed as just LEA. The use of a default, with mode not shown, is common to all SASS instructions with modes. Maybe NVIDIA thought always displaying the mode clutters up the display too much.

The role of the predicate pred is not known to me. At first I thought it is used for conditional execution but would have expected to see PT for unconditional execution in that case; however that is not what I am seeing. In the following : denotes concatenation of two register which together hold a 64-bit quantity.

LEA.LO computes dest = ((a.hi : a.lo) << imm_shift).LO + b
LEA.HI computes dest = ((a.hi : a.lo) << imm_shift).HI + b

The above is literally based on five minutes of analysis of generated code, and I cannot guarantee its correctness. But it should provide a reasonable idea of what this instruction does.

Thank you so much, njuffa!

Your answer is clear and detailed.

After checking your answer, I googled again about the x86 LEA.

You are right, LEA does the simple shift and add (maybe that is why it is faster than MUL, only 1 cycle is needed).

Though NVIDIA has provided no more official documentation, your method is helpful for future reverse engineering to see the varying LEA on different cards.

No worries about the correctness. I thank you for the intelligent logic!