How to understand the LEA assembly behind the cuda c++?

njuffa · June 24, 2023, 6:56pm

NVIDIA does not document the SASS (machine code) instructions in detail. The official documentation simply states:

LEA Compute Effective Address

By observation, it is a shift-then-add type of instruction, with a barrel shifter first acting on the input and the result from the barrel shifter being fed to an adder. It is most frequently used to compute 64-bit addresses into an aligned register pair for 64-bit addressing (keep in mind that GPUs use a 32-bit architecture with 64-bit addressing extension). But the GPU’s LEA, just like x86’s LEA, also has utility outside of address computations as a limited, but more efficient, alternative to IMAD, and the CUDA compiler “knows” how to use it as such.

The 0x5 in your example LEA shown above is the shift count for the left shift. You will need to look at more diverse instances of LEA to completely reverse engineer its functionality. I went through this exercise once, for Turing, but do not recollect the results in detail. I would not be surprised if there are small differences between the details of LEA between Turing / Ampere / Hopper, as NVIDIA does not maintain binary compatibility between GPU architectures.

c[x][y] refers to constant memory. x denotes the constant bank (there are about four of those), y the byte offset inside that bank. If memory serves, in recent GPU architectures, constant bank 0 is used (among other things) for passing kernel arguments. c[0x0][0x160] may well represent the first kernel argument.

That is not what it does. LEA on x86 is a (severely limited) left shift followed by a three-input add. E.g. lea eax, [ecx*4 + eax + 5]. The expression in brackets is not a reference to memory. A common idiom for multiplying a register by 5 would be lea eax, [4*eax + eax].

[Later:]

From a quick look at generated LEA instructions for thesm_75 (Turing) architecture, LEA in SASS looks like so:

LEA.{LO | HI} dst, pred, a.lo, b, a.hi, imm_shift

where all quantities except imm_shift comprise 32 bits, and .lo extracts the least signficant 32 bits while .hi extracts the most signficant 32 bits of a 64-bit quantity. The disassembler defaults to .LO which is therefore displayed as just LEA. The use of a default, with mode not shown, is common to all SASS instructions with modes. Maybe NVIDIA thought always displaying the mode clutters up the display too much.

The role of the predicate pred is not known to me. At first I thought it is used for conditional execution but would have expected to see PT for unconditional execution in that case; however that is not what I am seeing. In the following : denotes concatenation of two register which together hold a 64-bit quantity.

LEA.LO computes dest = ((a.hi : a.lo) << imm_shift).LO + b
LEA.HI computes dest = ((a.hi : a.lo) << imm_shift).HI + b

The above is literally based on five minutes of analysis of generated code, and I cannot guarantee its correctness. But it should provide a reasonable idea of what this instruction does.

Topic		Replies	Views
How does 'LEA' instruction works? CUDA Programming and Performance	2	2176	March 11, 2020
cuda program stuck in LEA instruction CUDA Programming and Performance	0	629	March 16, 2018
[Solved]SASS Code Analysis CUDA Programming and Performance	5	8625	November 30, 2017
About LD instruction for wmma CUDA Programming and Performance	2	599	July 5, 2023
Understanding PTX CUDA Programming and Performance	1	1038	December 11, 2019
The meaning of CUDA disassemly CUDA Programming and Performance	8	2059	December 11, 2019
Array Addressing CUDA Programming and Performance	4	755	January 29, 2020
Redundant MOVs? CUDA Programming and Performance	9	541	January 23, 2023
SASS, LDS.128, LD.128 and DRAM allocation CUDA Programming and Performance	7	4073	June 23, 2016
Where is the 32 bit int and 64 bit int address calculation done on CUDA hardware? CUDA Programming and Performance	1	735	December 29, 2016

How to understand the LEA assembly behind the cuda c++?

Related topics