Like most other functional units in the SM (the tensorcore units being a notable exception), my mental model for a functional unit in the SM including the LSU, is that each functional unit processes one instruction, per thread, per clock.
An FMA unit processes (i.e. can be dispatched) one floating-point instruction, per thread, per clock.
Therefore, if you want to process a FMUL instruction, warp-wide, for example, and you want to do it in a single clock cycle, it will require 32 of those type of functional units, i.e. 32 FMA units. I have covered this basic idea elsewhere.
I believe this is roughly supported, as far as it goes, by the marketing-oriented pictorial SM diagrams that you will find for example in the various architecture whitepapers.
Without further information, I would assume the same thing about an LSU unit. There are some number of LSU in a GPU SM, and if the SM has sub-partitions, those LSU units are probably partitioned also.
Just like an FMUL instruction, if a load or store instruction (e.g. LDG, STG) is issued warp-wide (as most instructions typically are), then it will require, in some form or fashion, 32 LSU units. If the SM sub-partition has only 16 LSU units (for example, a made up example), then I would assume that a LDG instruction would take two cycles to fully issue across the warp.
So based on that, I would say the basic or peak theoretical throughput of an LSU unit is one load/store instruction per thread per clock. If you have 32 of those units available, and we acknowledge that a warp has 32 threads, then for the collection of those 32 LSU units, the aggregate throughput would be one instruction, per warp, per clock cycle.
I would not say it any other way. For example, I would not say:
I’m not going to be able to describe that point further. I have indicated how I would say it, and I have now spelled out the reasons I would say it that way.
Regarding the “request queue” you mention, I would also be careful describing it that way. I don’t view the LSU as having a fixed size request queue. My view of it is that as long as downstream activity is moving satisfactorily, the LSU can accept one load or store instruction, per thread, per clock, forever. There is no queue at the front end of the operation, and it would not make sense to discuss a queue, because there isn’t any functional unit anywhere on the GPU that I know of, that you can feed more than one op/clk to. So if the LSU can handle one op/clk (which I claim, in ideal situations), then there is no logic to describe a queue and there is no queue, certainly not at the front end.
However, the LSU feeds the memory pipe, which is a not-fully-publicly-specified chunk of hardware on the GPU which takes care of operations to main memory. For example the DRAM controller is part of the memory pipe, but it is not part of the LSU AFAIK. So the LSU is a “front end” to some complex set of hardware. And that complex set of hardware can be fed in an ideal fashion, or in a non-ideal fashion.
With respect to global memory, the most obvious description of ideal vs. non-ideal is coalesced vs. non-coalesced, which is like saying a low number of transactions per request vs. a high number of transactions per request, which is like saying high efficiency vs. low efficiency, where efficiency is roughly defined as (bytes actually needed)/(bytes actually retrieved).
When LSU requests have a large proportion of low efficiency requests, I believe it is possible to saturate (fill up) one or more “downstream” queues, somewhere in the memory pipe, and this downstream “pressure” will eventually back up and manifest itself at the input to the LSU (somehow). Eventually, the LSU will become aware of this downstream pressure, and it will then be unable to accept new requests, suddenly, immediately, in that clock cycle.
This mechanism is not well described by NVIDIA as far as I know. So you can think of it as a queue in the LSU if you wish, but I don’t think of it that way. It’s more of an on/off signal, that propagates backward out of the memory pipe, and “shuts off” the LSU.
When that happens, as I already mentioned, it introduces the possibility for a stall in that instruction stream, if further LSU instructions are waiting. This type of stall would not be a typical dependency stall, and perhaps not even a register reservation stall. I don’t know what kind of stall it would be exactly.