What's the difference between MIO and LSU instruction queue in Volta architecture?

After seeing the hot chips presentation and discussion here, it seems like those are the two different kinds of instruction queues. Also, table 6 of nsight compute documentation seems to mention the two different thing, MIO throttle and LG Throttle.

But I can’t understand what’s the difference between the two. How are they different? Is a MIO instruction queue pertains to each sub-core of a SM, while LSU resides in ‘MIO schedular’ which is shared among sub-cores?

Is LSU ‘instruction queue’ same with LSU ‘execution unit’?


On Volta - Ampere the MIO (Memory Input Output) block consists of the LSU (Load Store Unit), Address Divergent Unit, Texture Unit, and slow math units (FP64 on non-compute parts, and Tensor on TU116/7). Instructions to these units are issued to instruction queues between the Sub-Core (profiler’s use the term sub-partition) and the MIO scheduler. The MIO scheduler dispatches instructions to these shared execution units.

LSU instruction queue is a FIFO of instructions sent to the Load Store Unit.

The LSU executes global/local/shared loads/store/atomics and several warp level operations (e.g. VOTE, SHUFFLE).

The MIO has separate instruction queues for other execution units including TEX and MUFU (XU in most tools).

The instruction queue for global/local and shared has a set watermark for global/local operations. If the watermark is reached any warp with the next instruction of type global/local LSU is stalled on LG throttle. The watermark is set to ensure that shared memory operations can still issue when L1 global/local (tagged accesses) pipeline is backed up. This can happen if the SM issues a lot of memory loads that miss in L2. If the next instruction of a warp is to any MIO instruction queue other than TEX and the queue is full (e.g. XU/MUFU, ADU) then the warp will report stalled on MIO throttle. TEX instruction queue has a separate stall reason called TEX throttle. The other queues do not have separate reasons as the queues vary between chips.