How to understand the LEA assembly behind the cuda c++?

NVIDIA does not document the SASS (machine code) instructions in detail. The official documentation simply states:

LEA Compute Effective Address

By observation, it is a shift-then-add type of instruction, with a barrel shifter first acting on the input and the result from the barrel shifter being fed to an adder. It is most frequently used to compute 64-bit addresses into an aligned register pair for 64-bit addressing (keep in mind that GPUs use a 32-bit architecture with 64-bit addressing extension). But the GPU’s LEA, just like x86’s LEA, also has utility outside of address computations as a limited, but more efficient, alternative to IMAD, and the CUDA compiler “knows” how to use it as such.

The 0x5 in your example LEA shown above is the shift count for the left shift. You will need to look at more diverse instances of LEA to completely reverse engineer its functionality. I went through this exercise once, for Turing, but do not recollect the results in detail. I would not be surprised if there are small differences between the details of LEA between Turing / Ampere / Hopper, as NVIDIA does not maintain binary compatibility between GPU architectures.

c[x][y] refers to constant memory. x denotes the constant bank (there are about four of those), y the byte offset inside that bank. If memory serves, in recent GPU architectures, constant bank 0 is used (among other things) for passing kernel arguments. c[0x0][0x160] may well represent the first kernel argument.

That is not what it does. LEA on x86 is a (severely limited) left shift followed by a three-input add. E.g. lea eax, [ecx*4 + eax + 5]. The expression in brackets is not a reference to memory. A common idiom for multiplying a register by 5 would be lea eax, [4*eax + eax].

[Later:]

From a quick look at generated LEA instructions for thesm_75 (Turing) architecture, LEA in SASS looks like so:

LEA.{LO | HI} dst, pred, a.lo, b, a.hi, imm_shift

where all quantities except imm_shift comprise 32 bits, and .lo extracts the least signficant 32 bits while .hi extracts the most signficant 32 bits of a 64-bit quantity. The disassembler defaults to .LO which is therefore displayed as just LEA. The use of a default, with mode not shown, is common to all SASS instructions with modes. Maybe NVIDIA thought always displaying the mode clutters up the display too much.

The role of the predicate pred is not known to me. At first I thought it is used for conditional execution but would have expected to see PT for unconditional execution in that case; however that is not what I am seeing. In the following : denotes concatenation of two register which together hold a 64-bit quantity.

LEA.LO computes dest = ((a.hi : a.lo) << imm_shift).LO + b
LEA.HI computes dest = ((a.hi : a.lo) << imm_shift).HI + b

The above is literally based on five minutes of analysis of generated code, and I cannot guarantee its correctness. But it should provide a reasonable idea of what this instruction does.

7 Likes