What's the difference between MIO and LSU instruction queue in Volta architecture?

yd11130055p1 · May 25, 2020, 11:11am

After seeing the hot chips presentation and discussion here, it seems like those are the two different kinds of instruction queues. Also, table 6 of nsight compute documentation seems to mention the two different thing, MIO throttle and LG Throttle.

But I can’t understand what’s the difference between the two. How are they different? Is a MIO instruction queue pertains to each sub-core of a SM, while LSU resides in ‘MIO schedular’ which is shared among sub-cores?

Is LSU ‘instruction queue’ same with LSU ‘execution unit’?

Thanks.

Greg · May 28, 2020, 9:56pm

On Volta - Ampere the MIO (Memory Input Output) block consists of the LSU (Load Store Unit), Address Divergent Unit, Texture Unit, and slow math units (FP64 on non-compute parts, and Tensor on TU116/7). Instructions to these units are issued to instruction queues between the Sub-Core (profiler’s use the term sub-partition) and the MIO scheduler. The MIO scheduler dispatches instructions to these shared execution units.

LSU instruction queue is a FIFO of instructions sent to the Load Store Unit.

The LSU executes global/local/shared loads/store/atomics and several warp level operations (e.g. VOTE, SHUFFLE).

The MIO has separate instruction queues for other execution units including TEX and MUFU (XU in most tools).

The instruction queue for global/local and shared has a set watermark for global/local operations. If the watermark is reached any warp with the next instruction of type global/local LSU is stalled on LG throttle. The watermark is set to ensure that shared memory operations can still issue when L1 global/local (tagged accesses) pipeline is backed up. This can happen if the SM issues a lot of memory loads that miss in L2. If the next instruction of a warp is to any MIO instruction queue other than TEX and the queue is full (e.g. XU/MUFU, ADU) then the warp will report stalled on MIO throttle. TEX instruction queue has a separate stall reason called TEX throttle. The other queues do not have separate reasons as the queues vary between chips.

Topic		Replies	Views
How does the LSU (Load/Store Unit) execute Load/Store instructions in the Ampere architecture? CUDA Programming and Performance	10	2253	November 28, 2023
Understanding instruction dispatching in Volta architecture CUDA Programming and Performance	5	3497	December 12, 2019
How to know my kernel if Pipeline parallel by nsight compute Nsight Compute	6	873	April 18, 2023
How to understand the "hide latency" CUDA Programming and Performance	13	3318	August 8, 2024
Things related to stall reasons... or not so related CUDA Programming and Performance	6	1993	April 14, 2017
questions about thread execution & volatile CUDA Programming and Performance	19	16896	December 29, 2008
Clock() and Clock64() Functions CUDA Programming and Performance cuda	10	1509	March 13, 2024
I need help understanding how concurrency of CUDA Cores and Tensor Cores works between Turing and Ampere/Ada? CUDA Programming and Performance cuda , tensorflow , rtx , ampere	10	1728	September 27, 2024
High Stall MIO Throttle OptiX	3	1132	December 7, 2023
branching and SIMD processor serialization vs predication CUDA Programming and Performance	7	10704	October 26, 2007

What's the difference between MIO and LSU instruction queue in Volta architecture?

Related topics