Pipeline forwarding means the output of the execution pipe is delivered to the next dependent instruction’s input to the pipe reducing the latency to writeback the output to the register file and read the value from the register file. This could be as small as 1 cycle (no delay) if the execution pipeline is only 1 cycle or it could simply reduce dependent latency by a few cycles.
The code has 4 dependent instructions.
1 ALU SHF.R.U32.HI R5, RZ, 0x4, R4
2 ALU LOP3.LUT R5, R5, c[0x0][0x160], R4, 0x48, !PT
3 FMA IMAD.SHL.U32 R6, R5, 0x10, RZ
4 ALU LOP3.LUT R6, R6, R5, R4, 0x96, !PT
3 instructions are executed by the ALU pipe.
1 instruction is executed by the FMA pipe.
The ALU pipe has a maximum throughput of 0.5 warp instructions/cycle per SMSP.
The FMA (IMAD) pipe has a maximum throughput of 0.5 warp instructions/cycle per SMSP.
The test by @gzz_2000 can be slightly modified to perform optimal launches of 1, 2, 3, … 8 warps per SMSP.
int main()
{
cudaDeviceProp props;
cudaGetDeviceProperties(&props, 0);
unsigned int* x;
cudaMalloc(&x, props.multiProcessorCount * 1024 * sizeof(unsigned int));
kernel<<<props.multiProcessorCount, 128 * 1 >>> (0xfacebabe, x);
kernel<<<props.multiProcessorCount, 128 * 2 >>> (0xfacebabe, x);
kernel<<<props.multiProcessorCount, 128 * 3 >>> (0xfacebabe, x);
kernel<<<props.multiProcessorCount, 128 * 4 >>> (0xfacebabe, x);
kernel<<<props.multiProcessorCount, 128 * 5 >>> (0xfacebabe, x);
kernel<<<props.multiProcessorCount, 128 * 6 >>> (0xfacebabe, x);
kernel<<<props.multiProcessorCount, 128 * 7 >>> (0xfacebabe, x);
kernel<<<props.multiProcessorCount, 128 * 8 >>> (0xfacebabe, x);
cudaDeviceSynchronize();
return 0;
}
The 1 warp per SMSP will show the full latency introduced by dependent instructions. As the number of warps per SMSP increases the SM will be limited by the instruction mix (ALU pipe).
ALU pipe can issue 0.5 instructions/cycle per SMSP.
FMA pipe can issue 0.5 instructions/cycle per SMSP.
Given the instruction mix below the maximum sustained throughput is 4 instructions / 6 cycles per SMSP.
1 ALU SHF.R.U32.HI R5, RZ, 0x4, R4
2 ALU LOP3.LUT R5, R5, c[0x0][0x160], R4, 0x48, !PT
3 FMA IMAD.SHL.U32 R6, R5, 0x10, RZ
4 ALU LOP3.LUT R6, R6, R5, R4, 0x96, !PT
time in cycles -->
1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8
ALU A - A - A - A - A - A - A - A - A -
FMA F - F - F -
A/F instruction issued
- pipe is busy
Comparing 1-8 warps per SMSP it is possible to determine when dependent latency is hidden by additional warps and when ALU pipe is saturated.
ID Grid Block Compute ALU IPC
Size Size Throughput
0 46, 1, 1 128, 1, 1 33.10 33.55 0.90
1 46, 1, 1 256, 1, 1 64.44 65.74 1.76
2 46, 1, 1 384, 1, 1 90.51 93.48 2.50
3 46, 1, 1 512, 1, 1 93.58 94.67 2.54
4 46, 1, 1 640, 1, 1 96.53 97.48 2.61
5 46, 1, 1 768, 1, 1 97.32 98.17 2.63
6 46, 1, 1 896, 1, 1 97.72 98.41 2.64
7 46, 1, 1 1024, 1, 1 97.83 98.62 2.64
The metrics used for ALU and IPC are:
ALU sm__inst_executed_pipe_alu.avg.pct_of_peak_sustained_active
IPC sm__inst_executed.avg.pct_of_peak_sustained_active
The maximum IPC is 2.64. SM maximum IPC for CC 7.0 - 9.0 is 4.0.
2.64 / 4.0 * 100. = 66%
which is the value predicted by the ALU and FMA instruction mix.
Depednent latency can be analyzed by reviewing the metric smsp__average_warps_issue_stalled_wait_per_issue_active.ratio = 3.35
cycles for all launch dimensions.
Grid ID > 1 are hiding the 3.35 cycles per instruction latency through additional FMA instructions (and startup instructions) until the ALU pipe is saturated. Saturation occurs approximately at 4 warps per SMSP.