Understanding IPC and Issue Slot Utilization when using Tensor Cores

jkosaian · August 12, 2019, 5:07pm

I’m seeking a conceptual understanding of the expected IPC/Issue Slot Utilization when using Volta Tensor Cores for large GEMMs to achieve near peak performance.

I have been benchmarking FP16 GEMMs with cublas and cublasLt on a Tesla V100 PCIe. A of GEMM size M = N = K = 8192 is able to achieve ~101 TFLOPS, which is ~90% of the peak theoretical TFLOPS.

My understanding is that, in order to achieve peak Tensor Core performance, one should utilize all available Tensor Cores on all warp schedulers on all SMs. This is reflected in the calculation for theoretical FLOPS below:

theoretical Tensor Core FLOPS = (# Tensor Cores) * (FLOPs / Tensor Core / cycle) * (cycles / second)
= (2 Tensor Cores/Warp Scheduler * 4 Warp Schedulers/SM * 80 SMs) * (128 FLOPs / Tensor Core / cycle) * (1.38 GHz) = ~112 TFLOPS

My expectation is that, for the GEMM described above which achieves 90% peak performance, I should thus see just under 4 instructions being executed per cycle (one for each warp scheduler), and similarly just under 100% issue slot utilization.

However, when I run the GEMM described above, under nvprof to retrieve ipc and issue_slot_utilization, I see the following output:

Kernel: volta_h884gemm_128x128_ldg8_nn
        100                                       ipc                              Executed IPC    1.194799    1.200118    1.196520
        100                    issue_slot_utilization                    Issue Slot Utilization      29.87%      30.00%      29.91%

As this IPC is much lower than my expectation above, there is clearly something incorrect in my understanding of IPC and Tensor Core operations.

Could someone walk me through the expected IPC and issue utilization for Tensor Cores or explain the IPC I report above? I can provide a code sample if this would be helpful.

Thanks!

Greg · August 14, 2019, 12:36am

In the case above the tensor instruction has a throughput 1 instruction per 4 cycles per SM sub-partition (warp scheduler). The maximum SM IPC in this case is 1.0 and maximum issue slot utilization is 25%. The kernel is interleaving instructions to the other instruction pipelines (FP32, ALU, Load Store Unit) between tensor instructions so the IPC is higher than the maximum tensor pipe IPC (for the given instruction mix).

The Volta SM can issue 1 FP32 every 2 cycles per SM sub-partition. So the maximum SM IPC for FP32 is 2.0 and the maximum issue slot utilization is 50%. The other 50% of cycles can be used to issue to other instruction pipelines.

Issue slot utilization is (called issue_active in newer tools) is the number of cycles the SM sub-partition issued at least 1 instructions. Kepler-Pascal supported dual-issue but this counts as 1 cycle. On Volta-Turing architecture issue slot utilization is IPC / MaximumIPC or IPC/4.0 per SM. The one difference is that issue slot utilization is based upon instructions issued not on instructions retired. There are several cases where issued > retired but in almost all cases it should be insignificant. On Kepler architecture reducing instruction replays was a key item to look at for code optimization as replays for memory address divergence stole cycles from math pipes. On Maxwell and above replays are handled in the memory units (shared memory, L1, TEX, …).

jkosaian · August 14, 2019, 3:05am

Thank you for the detailed explanation, Greg! This is exactly what I was looking for.

For my own future reference, are there an official document which you could point me to that list the throughputs of various instructions for a particular architecture (e.g., Volta)? I’ve found some of this information sprinkled in various blog posts, whitepapers, etc., but have not come across anything that lists these for different instructions.

Thanks again!

Robert_Crovella · August 14, 2019, 12:27pm

Some instruction throughputs are here:
[url]Programming Guide :: CUDA Toolkit Documentation
it’s not exhaustive, however.

jkosaian · August 14, 2019, 2:35pm

Thank you, Robert!

jkosaian · August 14, 2019, 5:18pm

After revisiting this, I think that there is still a gap in my understanding.

Does “tensor instruction” here refer to a primitive like wmma::mma_sync or a single HMMA.884.* instruction?

I have read elsewhere [1] that the wmma::mma_sync primitive used for performing 16x16x16 matrix multiplication on Tensor Cores is decomposed into components that each obtain 4x8 chunks of the overall 16x16 output. Computing a single 4x8 output chunk requires issuing 4 HMMA.884.* instructions (walking the K dimension of the 16x16x16 GEMM). From my understanding (extrapolating from [1]) each of these HMMA.884.* instructions will use the 2 Tensor Cores available on an SM sub-partition to compute a partial accumulation of the 4x8 output chunk.

Based on this, my expectation is that “tensor instruction” above is referring to the “higher level” primitive that computes a 4x8 output chunk of wmma::mma_sync, and that the 1 instruction in 4 cycles can be explained by needing to issue 4 HMMA.884.*, one each cycle.

If this is not the case (i.e., if “tensor instruction” refers to a single HMMA.884.*), then I am a bit lost. If each individual HMMA.884.* instruction requires 4 cycles to complete, then would this not mean that a Tensor Core is performing 64 FMAs per 4 cycles rather 64 FMAs per cycle?

Please let me know if the description above is unclear and I will try to better explain my misunderstanding. I apologize if any of the terminology I am using is incorrect or inconsistent.

[1] Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking [1804.06826] Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking Section 4.3 (p. 41)

Topic		Replies	Views
Is there a document about in which hardware unit(ie. ALU FMU...) an instruction is executed? CUDA Programming and Performance	35	2766	October 5, 2022
What can be learned from IPC (via nvprof)? CUDA Programming and Performance	9	3180	July 13, 2018
How to measure Tensor FLOPs? CUDA Programming and Performance tensorrt , cuda , kernel	14	1959	May 15, 2024
What limits the IPC in CUDA? or How to decrease the avg execution dependency cycles? CUDA Programming and Performance	6	7164	March 30, 2013
I need help understanding how concurrency of CUDA Cores and Tensor Cores works between Turing and Ampere/Ada? CUDA Programming and Performance cuda , tensorflow , rtx , ampere	10	1473	September 27, 2024
Question regarding Tensor Cores/GV100 CUDA Programming and Performance	8	2529	August 12, 2017
How to calculate the Tensor Core FP16 performance of H100? CUDA Programming and Performance	9	5774	August 14, 2024
Instruction level parallelism and maximum instruction per clock CUDA Programming and Performance	6	1099	September 3, 2019
Nvidia announces Tesla V100 (Volta) CUDA Programming and Performance	19	5225	November 30, 2017
Mma m8n8k4 on A100 CUDA Programming and Performance	10	87	November 14, 2024

Understanding IPC and Issue Slot Utilization when using Tensor Cores

Related topics