Does the STG.E instruction on Ampere occupy two clock cycles of the FMAHeavy pipeline?

When analyzing the warp state sampling data from Nsight Compute, two issues have left me puzzled. Below is the sequence of SASS instructions along with their corresponding sampling data (Ampere architecture, one warp per SMSP, sampling interval of 32 clock cycles):


From instruction 20 onwards begins the start of a for loop body.
The first question: Why half of the sampling data is wait_stall and the other half is selected for the instruction at line 22?
The mov instruction at line 21 incurs a significant amount of longscore board stall because register R5 is used as input for the STG.E instruction at the end of the loop body.

From the sampling data of instruction 21, the proportion of the selected state (4.06% ) indicates an average issuance of 25 clock cycles for this instruction. Despite the input dependency of instruction IADD3 at line 22 on the output of IADD3 at line 20, why is there still a need to wait for one clock cycle (evident from the sampling data of instruction 22, where the selected state accounts for 49.37%) after such a large number of clock cycles have passed?

The second question: The IMAD.WIDE instruction at line 30 also experiences a wait_stall. Is this due to the preceding STG.E instruction occupying the FMAHeavy pipeline?

I acknowledge this query might seem unusual, but I can’t think of another explanation for why the IMAD.WIDE instruction at line 30 would experience a wait_stall. Before this instruction, the last IMAD.WIDE instruction was at line 27. On the Ampere architecture, if two IMAD.WIDE instructions are independent of each other, the second instruction can be issued after an interval of 4 clock cycles. When the IMAD.WIDE instruction at line 27 is issued, from the sampling data at line 28 (where the selected state accounts for 26.18% and dispatched stall for 73.82%), it takes 4 clock cycles to issue the subsequent IADD3 instruction. This should leave the FMAHeavy pipeline idle and ready to accept the next instruction. However, the IMAD.WIDE at line 30 still requires 2 clock cycles to issue, with its state being half wait and half selected. The only explanation might be that the STG.E instruction at line 29 also needs the use of the FMAHeavy pipeline (for instance, the addresses in the input registers are not yet the actual memory addresses but are in the address space corresponding to the current CUDA context, requiring multiplication to convert to actual physical addresses). Moreover, the throughput of the computation executed in the FMAheavy pipeline is one instruction every 2 clock cycles (like multiplying two 32-bit integers to get another 32-bit integer). This could explain why all the sampling results for instruction 29 are selected and why the subsequent IMAD.WIDE instruction requires an additional clock cycle.

I’m uncertain if there are any flaws in my analysis. If there are, please point them out.

See here. stall_selected means the instruction was selected (its not actually a stall). stall_wait is an intentional act by the machine that acknowledges that a pipeline is not available. The approximate equal weighting of those two suggests that there may be a pipeline that can only accept a new instruction every 2 clocks.

To a first order approximation, I don’t think STG.E has anything to do with FMAHeavy pipe.

I don’t claim any specific knowledge about what is going on here.

Generally speaking, integer multiply-type instructions producing the full product (i.e. producing a wide result) require some trickery, as they need to write two registers worth of results. Typically each functional unit can write one, and only one, register per cycle (there is a limited, fixed, number of write ports on the register file).

One way of dealing with this is to treat a MUL.WIDE as a MUL.LO followed by a MUL.HI. Such op-splitting or double-pumping of the multiplier can be worked into the hardware without resorting to microcode. A variant of this might make a single pass through the multiplier, store the full product into a double-wide internal temp register, and then pump its content out to the register file in two consecutive cycles. In either case this means the front of the execution pipeline needs to be stalled for a cycle to prevent a result bus conflict.

An alternative is to steal / borrow a result slot from a second functional unit, allowing both halves of the full product to be delivered simultaneously: this complicates the scheduling of MUL.WIDE type instructions.

The observed behavior seems to be consistent with the former approach.

I’m thankful that we have genius wizards like njuffa, greg, striker159, rs277, and others, who are part of this community.

Your explanation is confusing me. The STG.E instruction seems to only read data from the register file rather than write to it, so the IMAD.WIDE instruction shouldn’t need to wait an additional clock cycle to avoid write bus conflicts, right?

I think the suggestion was that IMAD.WIDE needs to write a 64-bit result, and so may require two (32-bit) accesses to the register file, and therefore would require 2 clocks. This could be an explanation for the profiler observation that with IMAD.WIDE (line 22) there appear to be roughly equal samplings of stall_selected and stall_wait. That particular instruction, in that particular pipe, can only issue every other clock.

Most of this is just focused on answering this question:

I did not want to redundantly state what had already been stated by Robert Crovella:

In other words, STG.E is not in the picture (a red herring). The stalls are plausibly explained by IMAD.WIDE.U32 forcing a stall cycle in order to utilize two back-to-back result slots. Despite its once-every-two-cycles throughput IMAD.WIDE.U32 presumably exists because it is nonetheless more efficient than a hypothetical replacement composed strictly of 32-bit operations (the 64-bit summation would require use of a carry, extent, or condition flag, something which is best avoided).

The first question: Why half of the sampling data is wait_stall and the other half is selected for the instruction at line 22?

The ALU pipe issue rate is 0.5 warp instructions/cycle. The per cycle sequence for 1 warp per SMSP would be

CYCLE   LINE    REASON
0       21      SELECTED
1       22      WAIT
2       22      SELECTED

The mov instruction at line 21 incurs a significant amount of long scoreboard stall because register R5 is used as input for the STG.E instruction at the end of the loop body.

To verify I would need a minimal reproducible. A screenshot and/or NCU report does not have sufficient information. The compiler may be waiting for all scoreboards at the start of the loop as opposed to just the scoreboard on R5 from STG.E at the end of the loop.

From the sampling data of instruction 21, the proportion of the selected state (4.06% ) indicates an average issuance of 25 clock cycles for this instruction. Despite the input dependency of instruction IADD3 at line 22 on the output of IADD3 at line 20, why is there still a need to wait for one clock cycle (evident from the sampling data of instruction 22, where the selected state accounts for 49.37%) after such a large number of clock cycles have passed?

The 25 cycle average delay is before issuing line 21. There is an additional 1 cycle wait between the MOV and IADD3 as these both dispatch to the ALU pipe and the ALU pipe issue rate is 0.5 warp instructions/cycle so I would expect ~50% wait and %50 selected.

The second question: The IMAD.WIDE instruction at line 30 also experiences a wait_stall. Is this due to the preceding STG.E instruction occupying the FMAHeavy pipeline?

IMAD* uses the FMAheavy pipe. STG.E does not use the FMA* pipe. The compiler may have added a wait after the STG.E. The compiler often assumes more than 1 warp. I would need a minimal reproducible to confirm.

I acknowledge this query might seem unusual, but I can’t think of another explanation for why the IMAD.WIDE instruction at line 30 would experience a wait_stall.

Wait stalls should always be due to the previous instruction.

Thank you for your reply. This is my test code:

#include <stdexcept>
#include <cuda_runtime.h>
#include <cstdint>
#include <iostream>
__global__
void kernel(uint32_t *out, const uint32_t interval, const uint32_t step) {
    uint32_t offset = blockIdx.x * interval + threadIdx.x % warpSize;
    for (uint32_t i = 0; i < 100000; i++) {
        out[offset] = threadIdx.x;
        offset += step;
    }
}

void cudaCheckError(cudaError_t result){
    if (result != cudaSuccess) {
        throw std::runtime_error(cudaGetErrorString(result));
    }
}

int main() {
    cudaDeviceProp prop;
    cudaCheckError(cudaGetDeviceProperties(&prop, 0));
    uint32_t block_num = prop.multiProcessorCount;
    const uint32_t SMSP_NUM_ON_AMPERE = 4;
    uint32_t block_dim = SMSP_NUM_ON_AMPERE * prop.warpSize;
    uint32_t *d_out;
    cudaCheckError(cudaMalloc(&d_out, sizeof(uint32_t) * prop.warpSize));
    kernel<<<block_num, block_dim>>>(d_out, 0, 0);
    cudaCheckError(cudaDeviceSynchronize());
    cudaCheckError(cudaFree(d_out));
    return 0;
}

Or you can download from here
The complete compilation command is:

nvcc -arch=compute_86 -code=sm_86 -O3 -o main main.cu

CUDA: Cuda compilation tools, release 11.8, V11.8.89
OS: Ubuntu 20.04.6 LTS

I reinstalled different versions of CUDA, and also rewrote the source code, so the specific values obtained from the test might slightly differ from the previous ones.

Based on your response, I’ve grasped why the first issue might occur. Looking forward to your further explanation regarding the second problem.

The mov instruction at line 21 incurs a significant amount of long scoreboard stall because register R5 is used as input for the STG.E instruction at the end of the loop body.

To verify I would need a minimal reproducible. A screenshot and/or NCU report does not have sufficient information. The compiler may be waiting for all scoreboards at the start of the loop as opposed to just the scoreboard on R5 from STG.E at the end of the loop.

There is a scoreboard dependency on the end of the loop requiring that the last 7 STG.E have completed reading registers. I think the latency is high as due to the interleaving of IMAD.WIDE which is stalling the STG MIO register reads.

The second question: The IMAD.WIDE instruction at line 30 also experiences a wait_stall. Is this due to the preceding STG.E instruction occupying the FMAHeavy pipeline?

IMAD* uses the FMAheavy pipe. STG.E does not use the FMA* pipe. The compiler may have added a wait after the STG.E. The compiler often assumes more than 1 warp. I would need a minimal reproducible to confirm.

I confirmed there is an additional wait cycle specified as you observed. I will have to discuss internally why this is required. My guess is compiler is using statistical analysis on pipe throughput assuming a warp per sub-partition as baseline.

Thanks a lot. Looking forward to your further explanation regarding the second problem.

I might understand the reason behind my previous confusion about the experimental phenomena now. The crux of the issue should be the ‘wait_stall.’ I used to believe that wait_stall occurred during execution, as a result of an instruction needing to wait for the execution result of a previous instruction, hence dynamically waiting for several clock cycles. For instance, with the IMAD instruction, when a subsequent instruction needs to await the result of the previous instruction, it has to wait for a certain signal before it can start execution, causing the wait_stall. However, in reality, wait_stall should be statically determined by the compiler during compilation, possibly due to dependencies between instructions or other reasons. Regardless of why other instructions between two dependent instructions take longer to execute during runtime, the second instruction will wait for the fixed duration specified during compilation. Looking at the instruction at line 30, since the throughput capacity of IMAD.WIDE is one instruction every 4 clock cycles, after the execution completion of the IMAD.WIDE instruction at line 27, it still needs to wait for an additional 1 clock cycle. This required wait of 1 clock cycle is incorporated into the STG instruction at line 29. Regardless of how the IADD3 instruction at line 28 performs, it doesn’t affect the need for the 1 clock cycle wait_stall state during the execution of the instruction at line 30.
Thank you very much to all of you for your help.:)