I’m trying to gain a deeper understanding of the instruction and data flow within Ampere and later architecture SMs, but I’m a bit confused about the role of the MIO (Memory I/O) unit and its relationship with other components. I’d appreciate some clarification on the following points.
Prerequisites: My questions are focused on Ampere and subsequent architectures.
After reading this forum post, I noticed that the description of the MIO unit doesn’t seem to include its interaction with L1 Cache / Shared Memory (SMEM). This confused me. Could you clarify the specific relationship between the MIO unit and the L1/SMEM? What is the relationship between the LSU (Load/Store Unit) pipes and the L1/SMEM?
My confusion about the MIO’s role was amplified after reading this second post, which states, “The MIO unit is responsible for queuing and dispatching instructions to SM shared instruction pipelines.” I also understand there is a “shared pipe” within the SM (which might include FP64, Tensor Core pipes). Does this mean the MIO unit is responsible for dispatching instructions to shared pipe?
I’m trying to understand the instruction dispatch path. Does the warp scheduler in an SM sub-partition dispatch instructions directly to the various execution pipelines (like the Tensor Core pipe, FP64 pipe, etc.)? Or, does the warp scheduler first dispatch instructions to the MIO unit, which then dispatches them to the respective pipelines? Under what conditions would each of these dispatch paths occur?
After reading this forum post, I noticed that the description of the MIO unit doesn’t seem to include its interaction with L1 Cache / Shared Memory (SMEM). This confused me. Could you clarify the specific relationship between the MIO unit and the L1/SMEM? What is the relationship between the LSU (Load/Store Unit) pipes and the L1/SMEM?
The unified L1 has two primary command input streams:
Load Store Unit - Pipe for generic, shared, global, local, and dsmem operations.
Tex - Pipe for texture, surface (and while not for L1/SHMEM some SM level math units)
MIO is a generic dispatch unit feeding data from SM sub-partitions into arbiters for shared pipelines for L1TEX (LSU and TEX) as well as other shared units such as IDC (Index Constant Cache) and CBU (Control Branch Unit).
My confusion about the MIO’s role was amplified after reading this second post, which states, “The MIO unit is responsible for queuing and dispatching instructions to SM shared instruction pipelines.” I also understand there is a “shared pipe” within the SM (which might include FP64, Tensor Core pipes). Does this mean the MIO unit is responsible for dispatching instructions to shared pipe?
One of the joys of naming conventions is that when architectures change the names remain. The “shared pipe” is the per SM sub-partition dispatch port for units such as Tensor (IMMA, HMMA, DMMA, etc.) and FP64 that take in more registers than the FMA and ALU pipes.
The SM may also have SM level shared units that mean that all SM sub-partitions arbitrate for access to these units. The “shared” units are in MIO and include LSU, TEX, IDC, CBU, and FP64 (on consumer SMs).
I’m trying to understand the instruction dispatch path. Does the warp scheduler in an SM sub-partition dispatch instructions directly to the various execution pipelines (like the Tensor Core pipe, FP64 pipe, etc.)? Or, does the warp scheduler first dispatch instructions to the MIO unit, which then dispatches them to the respective pipelines? Under what conditions would each of these dispatch paths occur?
Great question.
The SM sub-partition can dispatch to multiple types of pipelines:
SMSP fixed latency pipelines such as
ALU
FMA (heavy and lite on GA10x+)
uniform
shared - Tensor (and fast FP64)
MIO instruction queues for SM sub-partition units
XU
MIO instruction queues for SM shared units (shared by all SM sub-partitions)
LSU
TEX
IDC
…
In the case of (3) there is a separate unit called the MIOC (Memory Input/Output Controller) that will dispatch instructions from per SMSP instruction queues to the shared instruction pipes in MIO. MIO has 2 level instruction issue.
Warp will report the following stall reasons for (2) and (3) above if the instruction queue is full:
TEX instructions will report tex_throttle
LSU to L1 instructions will report lg_throttle then mio_throttle
LSU to SHMEM instructions will report mio_throttle
All other MIO instructions will report mio_throttle
Based on my understanding, I’d like to propose a description and ask if it’s accurate:
The complete MIO (Memory I/O) subsystem is composed of the LSU (Load/Store Unit) pipes, the TEX (Texture) pipes, the L1TEX cache as a storage unit, an MIO queue, and an MIO scheduler. Is this an accurate description?
What units are responsible for scheduling instructions out of these respective queues and onto their corresponding pipelines?
For example, let’s say a smsp warp scheduler dispatches a XU instruction to the MIO instruction queue. What is the next step? How is this XU instruction dispatched from the MIO queue to the XU pipeline in smsp? And what about the instruction queues for SM shared units ?