How does the LSU (Load/Store Unit) execute Load/Store instructions in the Ampere architecture?

SparkHu · November 22, 2023, 12:37pm

I couldn’t find information on how Load/Store instructions are executed in the LSU under the Ampere architecture. Here’s my speculation:

When a warp needs to execute a load/store instruction, it’s asynchronously handed over to an LSU unit, allowing the warp to proceed with the next instruction that doesn’t depend on the current load/store instruction.
Each LSU can only handle one load/store instruction from one warp at a time.
The register used to pass addresses in load/store instructions cannot be written with new values until the load/store instruction completes.

Are these speculations accurate? Additionally, how does the LSU concurrently handle data access requests from 32 threads within the same warp? For instance, when reading data, the 32 addresses are grouped, where addresses within the same sector (32B) in global memory form a group. The LSU sends continuous requests to the L1 cache for each group of addresses, and upon L1 cache response, the LSU retrieves the required data from each sector and writes it into corresponding registers for each thread. Finally, does it update the Longscore board?

As for writing data, I speculate that it might still involve grouping the destination addresses, then sending the write requests to L1. It would then wait for L1 to complete the requests before updating the Longscore board, indicating to the threads that they can write data to the registers associated with the incoming destination addresses.

Eyal.Rozenberg · November 22, 2023, 1:48pm

Are these speculations accurate?

NVIDIA doesn’t typically tell us how their hardware really works. Even the information we do get is mostly “as-if”, i.e. it could be a model the hardware can be thought of as adhering to, rather than what the silicon actually does. So it could be that your speculation is valid on an abstract level but not at the level of the actual hardware.

The register used to pass addresses … cannot be written … until the load/store instruction completes.

Not sure why you believe that should be the case. I would guess it isn’t.

SparkHu · November 22, 2023, 2:24pm

Yes, because Nvidia hasn’t provided details about how their hardware is designed, it’s difficult for us to precisely understand how the hardware executes each instruction. However, in optimization work, if we can’t accurately comprehend how the hardware functions, it’s often challenging to maximize the hardware’s performance. Therefore, I can only propose imaginative hardware models continually, combining them with observed phenomena to verify and refine these models. So, I wanted to ask if anyone has noticed any occurrences contradicting the working method I’ve speculated about.

The reason behind my belief is that in Nsight Compute, certain instructions might cause a warp to be in a Longscore board stall state. However, the input registers of these instructions all come from the outputs of previous computational instructions. For example, in the diagram below:

The sampled data corresponds to the SASS instruction at line 42. Input register R34 also serves as input for instruction at line 40, and input register R5 is also used in the instruction at line 39. Hence, there shouldn’t be a longscore board stall caused by these two registers here.
The only potential cause for the longscore board stall could be the output registers R26 and R27, which incidentally serve as input registers for the STG.E instruction at line 32. Therefore, I speculate that before the store instruction completes its execution, or more specifically, before the store instruction finishes using its input registers, subsequent instructions cannot write new values into these input registers. It might take a considerable amount of time from when the store instruction is issued to the LSU for execution until the two input registers are completely utilized, potentially lasting until the LSU executes the store instruction and modifies the information stored in the longscore board.
PS：My understanding of the longscore board is still quite basic; I only have a rough idea that it’s hardware used to record the execution status of long-latency operations.

Robert_Crovella · November 22, 2023, 4:37pm

This is probably a question for Greg if he is able to provide any info. I’m not sure if he will be able to.

However, I’m fairly confident your speculations 1 and 2 are correct. (I would modify your speculation 2 as follows: “Each LSU can only handle one load/store instruction from one thread per clock cycle”) For speculation 2, this asynchronous hand-off also has a limited queue or depth. You won’t typically hit this with well designed code, but it’s possible that the LSU cannot accept new instructions due to downstream activity. This also appears as a stall of some sort, and would be a counter-example to the statement “Each LSU can only handle one load/store instruction from one thread per clock cycle”.

For speculation 3, I’m fairly certain there is an “immediate” period where the registers cannot be immediately reused. However I don’t know how long that state persists. A store operation (STG) is considered generally to be a “fire and forget” operation, so I’m fairly sure the register use reservation exists for some time, but not for the entire duration of the store operation in flight.

This does not necessarily mean that a subsequent instruction in the stream will not target those registers. It just means that if the register reservation is not released, and a subsequent instruction does target the register, a stall will occur.

It’s a reasonable assumption that the compiler is aware of such dependencies, and may attempt to order usage in such a way as to allow the “latency” of the register usage reservation to elapse, subject to other objectives the compiler may have, of course.

global access has variable latency, therefore it will typically be reported in long scoreboard accounting.

additional comments on long scoreboard can be found here and here (then do a text search for “scoreboard” on that last document).

rs277 · November 22, 2023, 7:10pm

This is borne out perhaps, by Scott Grey’s observation here, under the section, “Read Dependency Barriers”:

“Read barriers typically have a latency of about 20 clocks (the time it takes to set the memory instruction in flight). The complete instruction itself can take many more clocks than this.”

SparkHu · November 23, 2023, 2:38am

Based on your description, my understanding is: each LSU has its own request queue, and when a warp executes load/store instructions, it generates up to 32 requests, which are then added to a specific LSU’s request queue. The LSU then executes one of these requests per cycle (if downstream hardware is sufficiently idle to not hinder the LSU from executing a particular request). Therefore, is the throughput of the LSU for load/store instructions from a warp one instruction every 32 clock cycles?

Robert_Crovella · November 23, 2023, 2:24pm

Like most other functional units in the SM (the tensorcore units being a notable exception), my mental model for a functional unit in the SM including the LSU, is that each functional unit processes one instruction, per thread, per clock.

An FMA unit processes (i.e. can be dispatched) one floating-point instruction, per thread, per clock.

Therefore, if you want to process a FMUL instruction, warp-wide, for example, and you want to do it in a single clock cycle, it will require 32 of those type of functional units, i.e. 32 FMA units. I have covered this basic idea elsewhere.

I believe this is roughly supported, as far as it goes, by the marketing-oriented pictorial SM diagrams that you will find for example in the various architecture whitepapers.

Without further information, I would assume the same thing about an LSU unit. There are some number of LSU in a GPU SM, and if the SM has sub-partitions, those LSU units are probably partitioned also.

Just like an FMUL instruction, if a load or store instruction (e.g. LDG, STG) is issued warp-wide (as most instructions typically are), then it will require, in some form or fashion, 32 LSU units. If the SM sub-partition has only 16 LSU units (for example, a made up example), then I would assume that a LDG instruction would take two cycles to fully issue across the warp.

So based on that, I would say the basic or peak theoretical throughput of an LSU unit is one load/store instruction per thread per clock. If you have 32 of those units available, and we acknowledge that a warp has 32 threads, then for the collection of those 32 LSU units, the aggregate throughput would be one instruction, per warp, per clock cycle.

I would not say it any other way. For example, I would not say:

I’m not going to be able to describe that point further. I have indicated how I would say it, and I have now spelled out the reasons I would say it that way.

Regarding the “request queue” you mention, I would also be careful describing it that way. I don’t view the LSU as having a fixed size request queue. My view of it is that as long as downstream activity is moving satisfactorily, the LSU can accept one load or store instruction, per thread, per clock, forever. There is no queue at the front end of the operation, and it would not make sense to discuss a queue, because there isn’t any functional unit anywhere on the GPU that I know of, that you can feed more than one op/clk to. So if the LSU can handle one op/clk (which I claim, in ideal situations), then there is no logic to describe a queue and there is no queue, certainly not at the front end.

However, the LSU feeds the memory pipe, which is a not-fully-publicly-specified chunk of hardware on the GPU which takes care of operations to main memory. For example the DRAM controller is part of the memory pipe, but it is not part of the LSU AFAIK. So the LSU is a “front end” to some complex set of hardware. And that complex set of hardware can be fed in an ideal fashion, or in a non-ideal fashion.

With respect to global memory, the most obvious description of ideal vs. non-ideal is coalesced vs. non-coalesced, which is like saying a low number of transactions per request vs. a high number of transactions per request, which is like saying high efficiency vs. low efficiency, where efficiency is roughly defined as (bytes actually needed)/(bytes actually retrieved).

When LSU requests have a large proportion of low efficiency requests, I believe it is possible to saturate (fill up) one or more “downstream” queues, somewhere in the memory pipe, and this downstream “pressure” will eventually back up and manifest itself at the input to the LSU (somehow). Eventually, the LSU will become aware of this downstream pressure, and it will then be unable to accept new requests, suddenly, immediately, in that clock cycle.

This mechanism is not well described by NVIDIA as far as I know. So you can think of it as a queue in the LSU if you wish, but I don’t think of it that way. It’s more of an on/off signal, that propagates backward out of the memory pipe, and “shuts off” the LSU.

When that happens, as I already mentioned, it introduces the possibility for a stall in that instruction stream, if further LSU instructions are waiting. This type of stall would not be a typical dependency stall, and perhaps not even a register reservation stall. I don’t know what kind of stall it would be exactly.

Greg · November 27, 2023, 8:03pm

Each cycle the warp scheduler can issue an instruction to the LSU/MIO instruction queue.
Each cycle the MIO unit can issue 1 instruction to the LSU pipe. This limits the issue rate from 4 IPC per SM to 1 IPC per SM.
An instruction cannot be dispatched from MIOC until all registers have been read from the register file.
Once all registers have been read the instruction is ready to be dispatched. The compiler may update a scoreboard to enable the registers to be re-used. If the warp will report long score board until the registers are available. The compiler may also choose to wait until the instruction has returned (e.g. load). If a warp hits an instruction waiting on the register the warp will be stalled on long scoreboard.
Load store instructions are dispatched to the shared memory pipe or the tagged pipe.
The load store pipeline calculates the tag for each thread. Threads are grouped together. On GV100+ the L1TEX tag stage can resolve 4 sets x 4 sectors per cycle. If not all threads can be resolved in one wavefront then the instruction will continue to generate new wavefronts in the tag stage until all threads are handled.

When a warp needs to execute a load/store instruction, it’s asynchronously handed over to an LSU unit, allowing the warp to proceed with the next instruction that doesn’t depend on the current load/store instruction.

Yes.

Each LSU can only handle one load/store instruction from one warp at a time.

The LSU pipeline accepts 1 instruction per cycle. The LSU pipeline can contain 100s of inflight instructions.

The register used to pass addresses in load/store instructions cannot be written with new values until the load/store instruction completes.

The compiler can choose to release a scoreboard after the MIO has dispatched the instruction to LSU and/or after the instruction has retired.

Based on your description, my understanding is: each LSU has its own request queue, and when a warp executes load/store instructions, it generates up to 32 requests, which are then added to a specific LSU’s request queue. The LSU then executes one of these requests per cycle (if downstream hardware is sufficiently idle to not hinder the LSU from executing a particular request).

The MIO instruction queue (shallow) is before the LSU unit. The LSU pipe will continue to generate new wavefronts in the t-stage for set conflicts. Wavefronts are generated every cycle. A warp that does a store to 32 different sectors will generate 32 wavefronts. The shared memory and tag pipeline operate simultaneously. A 32-bit load of consecutive 4-byte addresses can complete in 1 t-stage wavefront and 4 miss stage wavefronts.

SparkHu · November 28, 2023, 3:26am

Thank you for your response; it has given me a clearer understanding of how the underlying hardware functions. Additionally, I have another question: on the Ampere architecture, within each SMSP, there are 4 LSUs. Are they independently operating (each LSU handling load/store instructions for a warp individually) or do they collaborate together to execute instructions for a warp? My understanding is that there is only one LSU/MIO instruction queue per SMSP, and all LSUs within an SMSP collaborate to execute instructions from the queue, is that correct?

Greg · November 28, 2023, 5:52pm

Each SM has 1 unified L1 TEX unit that covers global, local, shared, surface, and texture. The two interfaces are LSUIN (global, local, shared) and TEXIN (surface and texture). MIO can issue 1 instruction to LSUIN per cycle and 1 instruction to TEXIN per cycle. LSU and TEX both have separate write-back paths.

LSU
- 100 class parts can process 32 threads per cycle.
- 10x class parts can process 16 threads per cycle.
TEX
- TEX generally processed at 4T (quad) per cycle; however, some surface and texture operations support higher rates.

system · December 12, 2023, 5:52pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
questions about thread execution & volatile CUDA Programming and Performance	19	16896	December 29, 2008
Kernel launch failure plus Warp execution performance CUDA Programming and Performance hw , cuda	13	659	May 9, 2024
How to understand the "hide latency" CUDA Programming and Performance	13	3386	August 8, 2024
Warp Size Question CUDA Programming and Performance	21	13964	June 18, 2010
CUDA Kernel self-suspension ? Can a CUDA Kernel conditionally suspend its execution ? CUDA Programming and Performance	46	45210	April 17, 2011
I need help understanding how concurrency of CUDA Cores and Tensor Cores works between Turing and Ampere/Ada? CUDA Programming and Performance cuda , tensorflow , rtx , ampere	10	1814	September 27, 2024
Requesting clarification for Shared Memory Bank Conflicts and Shared memory access? CUDA Programming and Performance hw , cuda	11	4045	January 23, 2024
Why the performance of tf32 tensor_core is poor? CUDA Programming and Performance	20	1777	August 8, 2023
Latency and low-level performance questions CUDA Programming and Performance	10	4287	June 23, 2015
'Computations server' application design advice CUDA Programming and Performance	24	12675	March 23, 2007

How does the LSU (Load/Store Unit) execute Load/Store instructions in the Ampere architecture?

Related topics