I’m seeking a conceptual understanding of the expected IPC/Issue Slot Utilization when using Volta Tensor Cores for large GEMMs to achieve near peak performance.
I have been benchmarking FP16 GEMMs with cublas and cublasLt on a Tesla V100 PCIe. A of GEMM size M = N = K = 8192 is able to achieve ~101 TFLOPS, which is ~90% of the peak theoretical TFLOPS.
My understanding is that, in order to achieve peak Tensor Core performance, one should utilize all available Tensor Cores on all warp schedulers on all SMs. This is reflected in the calculation for theoretical FLOPS below:
theoretical Tensor Core FLOPS = (# Tensor Cores) * (FLOPs / Tensor Core / cycle) * (cycles / second)
= (2 Tensor Cores/Warp Scheduler * 4 Warp Schedulers/SM * 80 SMs) * (128 FLOPs / Tensor Core / cycle) * (1.38 GHz) = ~112 TFLOPS
My expectation is that, for the GEMM described above which achieves 90% peak performance, I should thus see just under 4 instructions being executed per cycle (one for each warp scheduler), and similarly just under 100% issue slot utilization.
However, when I run the GEMM described above, under nvprof to retrieve ipc and issue_slot_utilization, I see the following output:
Kernel: volta_h884gemm_128x128_ldg8_nn
100 ipc Executed IPC 1.194799 1.200118 1.196520
100 issue_slot_utilization Issue Slot Utilization 29.87% 30.00% 29.91%
As this IPC is much lower than my expectation above, there is clearly something incorrect in my understanding of IPC and Tensor Core operations.
Could someone walk me through the expected IPC and issue utilization for Tensor Cores or explain the IPC I report above? I can provide a code sample if this would be helpful.
Thanks!