How does the LSU (Load/Store Unit) execute Load/Store instructions in the Ampere architecture?

Each cycle the warp scheduler can issue an instruction to the LSU/MIO instruction queue.
Each cycle the MIO unit can issue 1 instruction to the LSU pipe. This limits the issue rate from 4 IPC per SM to 1 IPC per SM.
An instruction cannot be dispatched from MIOC until all registers have been read from the register file.
Once all registers have been read the instruction is ready to be dispatched. The compiler may update a scoreboard to enable the registers to be re-used. If the warp will report long score board until the registers are available. The compiler may also choose to wait until the instruction has returned (e.g. load). If a warp hits an instruction waiting on the register the warp will be stalled on long scoreboard.
Load store instructions are dispatched to the shared memory pipe or the tagged pipe.
The load store pipeline calculates the tag for each thread. Threads are grouped together. On GV100+ the L1TEX tag stage can resolve 4 sets x 4 sectors per cycle. If not all threads can be resolved in one wavefront then the instruction will continue to generate new wavefronts in the tag stage until all threads are handled.

  1. When a warp needs to execute a load/store instruction, it’s asynchronously handed over to an LSU unit, allowing the warp to proceed with the next instruction that doesn’t depend on the current load/store instruction.

Yes.

  1. Each LSU can only handle one load/store instruction from one warp at a time.

The LSU pipeline accepts 1 instruction per cycle. The LSU pipeline can contain 100s of inflight instructions.

  1. The register used to pass addresses in load/store instructions cannot be written with new values until the load/store instruction completes.

The compiler can choose to release a scoreboard after the MIO has dispatched the instruction to LSU and/or after the instruction has retired.

Based on your description, my understanding is: each LSU has its own request queue, and when a warp executes load/store instructions, it generates up to 32 requests, which are then added to a specific LSU’s request queue. The LSU then executes one of these requests per cycle (if downstream hardware is sufficiently idle to not hinder the LSU from executing a particular request).

The MIO instruction queue (shallow) is before the LSU unit. The LSU pipe will continue to generate new wavefronts in the t-stage for set conflicts. Wavefronts are generated every cycle. A warp that does a store to 32 different sectors will generate 32 wavefronts. The shared memory and tag pipeline operate simultaneously. A 32-bit load of consecutive 4-byte addresses can complete in 1 t-stage wavefront and 4 miss stage wavefronts.