Understanding IPC and Issue Slot Utilization when using Tensor Cores

I’m seeking a conceptual understanding of the expected IPC/Issue Slot Utilization when using Volta Tensor Cores for large GEMMs to achieve near peak performance.

I have been benchmarking FP16 GEMMs with cublas and cublasLt on a Tesla V100 PCIe. A of GEMM size M = N = K = 8192 is able to achieve ~101 TFLOPS, which is ~90% of the peak theoretical TFLOPS.

My understanding is that, in order to achieve peak Tensor Core performance, one should utilize all available Tensor Cores on all warp schedulers on all SMs. This is reflected in the calculation for theoretical FLOPS below:

theoretical Tensor Core FLOPS = (# Tensor Cores) * (FLOPs / Tensor Core / cycle) * (cycles / second)
= (2 Tensor Cores/Warp Scheduler * 4 Warp Schedulers/SM * 80 SMs) * (128 FLOPs / Tensor Core / cycle) * (1.38 GHz) = ~112 TFLOPS

My expectation is that, for the GEMM described above which achieves 90% peak performance, I should thus see just under 4 instructions being executed per cycle (one for each warp scheduler), and similarly just under 100% issue slot utilization.

However, when I run the GEMM described above, under nvprof to retrieve ipc and issue_slot_utilization, I see the following output:

Kernel: volta_h884gemm_128x128_ldg8_nn
        100                                       ipc                              Executed IPC    1.194799    1.200118    1.196520
        100                    issue_slot_utilization                    Issue Slot Utilization      29.87%      30.00%      29.91%

As this IPC is much lower than my expectation above, there is clearly something incorrect in my understanding of IPC and Tensor Core operations.

Could someone walk me through the expected IPC and issue utilization for Tensor Cores or explain the IPC I report above? I can provide a code sample if this would be helpful.

Thanks!

In the case above the tensor instruction has a throughput 1 instruction per 4 cycles per SM sub-partition (warp scheduler). The maximum SM IPC in this case is 1.0 and maximum issue slot utilization is 25%. The kernel is interleaving instructions to the other instruction pipelines (FP32, ALU, Load Store Unit) between tensor instructions so the IPC is higher than the maximum tensor pipe IPC (for the given instruction mix).

The Volta SM can issue 1 FP32 every 2 cycles per SM sub-partition. So the maximum SM IPC for FP32 is 2.0 and the maximum issue slot utilization is 50%. The other 50% of cycles can be used to issue to other instruction pipelines.

Issue slot utilization is (called issue_active in newer tools) is the number of cycles the SM sub-partition issued at least 1 instructions. Kepler-Pascal supported dual-issue but this counts as 1 cycle. On Volta-Turing architecture issue slot utilization is IPC / MaximumIPC or IPC/4.0 per SM. The one difference is that issue slot utilization is based upon instructions issued not on instructions retired. There are several cases where issued > retired but in almost all cases it should be insignificant. On Kepler architecture reducing instruction replays was a key item to look at for code optimization as replays for memory address divergence stole cycles from math pipes. On Maxwell and above replays are handled in the memory units (shared memory, L1, TEX, …).

Thank you for the detailed explanation, Greg! This is exactly what I was looking for.

For my own future reference, are there an official document which you could point me to that list the throughputs of various instructions for a particular architecture (e.g., Volta)? I’ve found some of this information sprinkled in various blog posts, whitepapers, etc., but have not come across anything that lists these for different instructions.

Thanks again!

Some instruction throughputs are here:
[url]Programming Guide :: CUDA Toolkit Documentation
it’s not exhaustive, however.

Thank you, Robert!

After revisiting this, I think that there is still a gap in my understanding.

Does “tensor instruction” here refer to a primitive like wmma::mma_sync or a single HMMA.884.* instruction?

I have read elsewhere [1] that the wmma::mma_sync primitive used for performing 16x16x16 matrix multiplication on Tensor Cores is decomposed into components that each obtain 4x8 chunks of the overall 16x16 output. Computing a single 4x8 output chunk requires issuing 4 HMMA.884.* instructions (walking the K dimension of the 16x16x16 GEMM). From my understanding (extrapolating from [1]) each of these HMMA.884.* instructions will use the 2 Tensor Cores available on an SM sub-partition to compute a partial accumulation of the 4x8 output chunk.

Based on this, my expectation is that “tensor instruction” above is referring to the “higher level” primitive that computes a 4x8 output chunk of wmma::mma_sync, and that the 1 instruction in 4 cycles can be explained by needing to issue 4 HMMA.884.*, one each cycle.

If this is not the case (i.e., if “tensor instruction” refers to a single HMMA.884.*), then I am a bit lost. If each individual HMMA.884.* instruction requires 4 cycles to complete, then would this not mean that a Tensor Core is performing 64 FMAs per 4 cycles rather 64 FMAs per cycle?

Please let me know if the description above is unclear and I will try to better explain my misunderstanding. I apologize if any of the terminology I am using is incorrect or inconsistent.

[1] Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking [1804.06826] Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking Section 4.3 (p. 41)