How does the LSU (Load/Store Unit) execute Load/Store instructions in the Ampere architecture?

Greg · November 27, 2023, 8:03pm

Each cycle the warp scheduler can issue an instruction to the LSU/MIO instruction queue.
Each cycle the MIO unit can issue 1 instruction to the LSU pipe. This limits the issue rate from 4 IPC per SM to 1 IPC per SM.
An instruction cannot be dispatched from MIOC until all registers have been read from the register file.
Once all registers have been read the instruction is ready to be dispatched. The compiler may update a scoreboard to enable the registers to be re-used. If the warp will report long score board until the registers are available. The compiler may also choose to wait until the instruction has returned (e.g. load). If a warp hits an instruction waiting on the register the warp will be stalled on long scoreboard.
Load store instructions are dispatched to the shared memory pipe or the tagged pipe.
The load store pipeline calculates the tag for each thread. Threads are grouped together. On GV100+ the L1TEX tag stage can resolve 4 sets x 4 sectors per cycle. If not all threads can be resolved in one wavefront then the instruction will continue to generate new wavefronts in the tag stage until all threads are handled.

When a warp needs to execute a load/store instruction, it’s asynchronously handed over to an LSU unit, allowing the warp to proceed with the next instruction that doesn’t depend on the current load/store instruction.

Yes.

Each LSU can only handle one load/store instruction from one warp at a time.

The LSU pipeline accepts 1 instruction per cycle. The LSU pipeline can contain 100s of inflight instructions.

The register used to pass addresses in load/store instructions cannot be written with new values until the load/store instruction completes.

The compiler can choose to release a scoreboard after the MIO has dispatched the instruction to LSU and/or after the instruction has retired.

Based on your description, my understanding is: each LSU has its own request queue, and when a warp executes load/store instructions, it generates up to 32 requests, which are then added to a specific LSU’s request queue. The LSU then executes one of these requests per cycle (if downstream hardware is sufficiently idle to not hinder the LSU from executing a particular request).

The MIO instruction queue (shallow) is before the LSU unit. The LSU pipe will continue to generate new wavefronts in the t-stage for set conflicts. Wavefronts are generated every cycle. A warp that does a store to 32 different sectors will generate 32 wavefronts. The shared memory and tag pipeline operate simultaneously. A 32-bit load of consecutive 4-byte addresses can complete in 1 t-stage wavefront and 4 miss stage wavefronts.

Topic		Replies	Views
coalesced access and hardware Load/Store units CUDA Programming and Performance	4	3249	July 6, 2017
Bandwidth of shared memory load CUDA Programming and Performance	1	170	June 17, 2024
What's the difference between MIO and LSU instruction queue in Volta architecture? CUDA Programming and Performance hw	1	3540	May 28, 2020
on load/ store units CUDA Programming and Performance	2	1022	November 19, 2014
Understanding instruction dispatching in Volta architecture CUDA Programming and Performance	5	3761	December 12, 2019
LSU Wavefront Scheduling and Shared Memory Bank Utilization on Blackwell CUDA Programming and Performance	6	59	February 6, 2026
What is the functionality of LD/ST units in SM? GPU - Hardware	4	782	May 23, 2024
Global Load and Texture Load on LSU Traffic CUDA Programming and Performance cuda	5	164	November 18, 2025
Why does my actual measured count of shared memory load/store instructions differ from the theoretical count? How can I explain and verify this differ GPU-Accelerated Libraries	1	31	November 14, 2025
Is there a document about in which hardware unit(ie. ALU FMU...) an instruction is executed? CUDA Programming and Performance	35	3680	October 5, 2022

How does the LSU (Load/Store Unit) execute Load/Store instructions in the Ampere architecture?

Related topics