Questions about MIO, LSU, L1/SMEM, and Instruction Dispatch on SM80+ Architectures

zksainx · September 30, 2025, 7:52pm

I’m trying to gain a deeper understanding of the instruction and data flow within Ampere and later architecture SMs, but I’m a bit confused about the role of the MIO (Memory I/O) unit and its relationship with other components. I’d appreciate some clarification on the following points.

Prerequisites: My questions are focused on Ampere and subsequent architectures.

After reading this forum post, I noticed that the description of the MIO unit doesn’t seem to include its interaction with L1 Cache / Shared Memory (SMEM). This confused me. Could you clarify the specific relationship between the MIO unit and the L1/SMEM? What is the relationship between the LSU (Load/Store Unit) pipes and the L1/SMEM?
My confusion about the MIO’s role was amplified after reading this second post, which states, “The MIO unit is responsible for queuing and dispatching instructions to SM shared instruction pipelines.” I also understand there is a “shared pipe” within the SM (which might include FP64, Tensor Core pipes). Does this mean the MIO unit is responsible for dispatching instructions to shared pipe?
I’m trying to understand the instruction dispatch path. Does the warp scheduler in an SM sub-partition dispatch instructions directly to the various execution pipelines (like the Tensor Core pipe, FP64 pipe, etc.)? Or, does the warp scheduler first dispatch instructions to the MIO unit, which then dispatches them to the respective pipelines? Under what conditions would each of these dispatch paths occur?

Thank you for your help!

Greg · September 30, 2025, 8:07pm

After reading this forum post, I noticed that the description of the MIO unit doesn’t seem to include its interaction with L1 Cache / Shared Memory (SMEM). This confused me. Could you clarify the specific relationship between the MIO unit and the L1/SMEM? What is the relationship between the LSU (Load/Store Unit) pipes and the L1/SMEM?

The unified L1 has two primary command input streams:

Load Store Unit - Pipe for generic, shared, global, local, and dsmem operations.
Tex - Pipe for texture, surface (and while not for L1/SHMEM some SM level math units)

MIO is a generic dispatch unit feeding data from SM sub-partitions into arbiters for shared pipelines for L1TEX (LSU and TEX) as well as other shared units such as IDC (Index Constant Cache) and CBU (Control Branch Unit).

My confusion about the MIO’s role was amplified after reading this second post, which states, “The MIO unit is responsible for queuing and dispatching instructions to SM shared instruction pipelines.” I also understand there is a “shared pipe” within the SM (which might include FP64, Tensor Core pipes). Does this mean the MIO unit is responsible for dispatching instructions to shared pipe?

One of the joys of naming conventions is that when architectures change the names remain. The “shared pipe” is the per SM sub-partition dispatch port for units such as Tensor (IMMA, HMMA, DMMA, etc.) and FP64 that take in more registers than the FMA and ALU pipes.

The SM may also have SM level shared units that mean that all SM sub-partitions arbitrate for access to these units. The “shared” units are in MIO and include LSU, TEX, IDC, CBU, and FP64 (on consumer SMs).

I’m trying to understand the instruction dispatch path. Does the warp scheduler in an SM sub-partition dispatch instructions directly to the various execution pipelines (like the Tensor Core pipe, FP64 pipe, etc.)? Or, does the warp scheduler first dispatch instructions to the MIO unit, which then dispatches them to the respective pipelines? Under what conditions would each of these dispatch paths occur?

Great question.
The SM sub-partition can dispatch to multiple types of pipelines:

SMSP fixed latency pipelines such as
1. ALU
2. FMA (heavy and lite on GA10x+)
3. uniform
4. shared - Tensor (and fast FP64)
MIO instruction queues for SM sub-partition units
1. XU
MIO instruction queues for SM shared units (shared by all SM sub-partitions)
1. LSU
2. TEX
3. IDC
  …

In the case of (3) there is a separate unit called the MIOC (Memory Input/Output Controller) that will dispatch instructions from per SMSP instruction queues to the shared instruction pipes in MIO. MIO has 2 level instruction issue.

Warp will report the following stall reasons for (2) and (3) above if the instruction queue is full:

TEX instructions will report tex_throttle
LSU to L1 instructions will report lg_throttle then mio_throttle
LSU to SHMEM instructions will report mio_throttle
All other MIO instructions will report mio_throttle

zksainx · October 1, 2025, 4:58am

Based on my understanding, I’d like to propose a description and ask if it’s accurate:

The complete MIO (Memory I/O) subsystem is composed of the LSU (Load/Store Unit) pipes, the TEX (Texture) pipes, the L1TEX cache as a storage unit, an MIO queue, and an MIO scheduler. Is this an accurate description?

What units are responsible for scheduling instructions out of these respective queues and onto their corresponding pipelines?

For example, let’s say a smsp warp scheduler dispatches a XU instruction to the MIO instruction queue. What is the next step? How is this XU instruction dispatched from the MIO queue to the XU pipeline in smsp? And what about the instruction queues for SM shared units ？

Topic		Replies	Views
What's the difference between MIO and LSU instruction queue in Volta architecture? CUDA Programming and Performance hw	1	3370	May 28, 2020
Inquisitive about SP cores in SMs CUDA Programming and Performance	3	1414	October 1, 2009
Are 'MOV' instructions performed by the ALUs or in parallel? Reference to Volkov's paper CUDA Programming and Performance	3	1839	February 9, 2009
Question regarding Pascal architecture CUDA Programming and Performance	13	3032	March 16, 2017
GPU architecture and CUDA kernel execution CUDA Programming and Performance	13	24940	September 6, 2009
Understanding instruction dispatching in Volta architecture CUDA Programming and Performance	5	3628	December 12, 2019
GT 200 performance questions Is it possible to achieve IPC > 1? CUDA Programming and Performance	5	4133	January 7, 2009
Pipelined Loads CUDA Programming and Performance	54	7331	September 21, 2010
Why half-warp coalesced memory reads? CUDA Programming and Performance	4	4156	September 10, 2009
How do multicore access shared memory at same time? CUDA Programming and Performance	1	5801	January 15, 2009

Questions about MIO, LSU, L1/SMEM, and Instruction Dispatch on SM80+ Architectures

Related topics