Wrong pipe utilization for Tensor (FP)?

Curefab · November 6, 2021, 3:51pm

On my RTX 2060.
When I call “mma.sync.aligned.m16n8k8.row.col.f16.f16.f16.f16” 3.200.000x(32 Threads/Warp) in a loop on one SM, I get a utilization of 49,56% inside Nsight Compute 2021.3.0.0. The program runs 6.467.014 cycles.

As I understand this is the maximum speed: 2 Tensor Cores/Partition x 4 Partitions x 64 F16 FMA Operations/Cycle = 512 Operations/cycle divided by (16x8x8 =) 1024 multiplications. So 2 cycles per SM per warp-wide mma instruction.

Could it be that the defined maximum pipeline utilization is wrong for Turing inside Nsight Compute?
The Tensor (Int) Pipeline goes up to 100%.

Or have I chosen a too slow operation and there are floating-point ones with 1 op/cycle?

Topic		Replies	Views
Why Low Tensor Pipe Utilization CUDA Programming and Performance cuda , kernel	4	1522	May 20, 2022
How to compute performance in GFLOPS ? CUDA Programming and Performance	25	12255	November 17, 2008
Understanding IPC and Issue Slot Utilization when using Tensor Cores CUDA Programming and Performance	5	2535	August 14, 2019
Maximum Tensor Core utilization Nsight Compute cuda , kernel	4	334	March 20, 2025
What limits the IPC in CUDA? or How to decrease the avg execution dependency cycles? CUDA Programming and Performance	6	7284	March 30, 2013
GPU architecture and CUDA kernel execution CUDA Programming and Performance	13	25065	September 6, 2009
warp and core What's the relationship between warp and core? CUDA Programming and Performance	12	15744	February 4, 2011
Roofline Model for Nvidia GTX1080 CUDA Programming and Performance	1	708	September 19, 2018
How close to peak can you get on a CPU? CUDA Programming and Performance	33	3131	November 9, 2010
1 MP has 8 SP, but warp size is 32! CUDA Programming and Performance	6	3527	January 22, 2009

Wrong pipe utilization for Tensor (FP)?

Related topics