Why Low Tensor Pipe Utilization

Hi, I use NCU to profile a simple WMMA code example from

. However, NCU shows the tensor pipe utilization is around 25% even using more loops for mma_sync and disabling data loading or storing. There should be enough blocks to occupy SMs. I wonder why tensor pipe utilization is so low, and how to improve it.

Which metric are you looking at for the tensor pipe utilization?

sm__pipe_tensor_cycles_active.avg.pct_of_peak_sustained_active

or one of

sm__inst_executed_pipe_tensor_op_*

? If you get the value from a table or chart in the UI, you may get the underlying metric name from a tooltip. Please also indicate the Nsight Compute version you are using.

Hi, I am using CUDA11.6 and NCU 2022.1.1.
And I am looking at “sm__inst_executed_pipe_tensor_op_.hmma.avg.pct_of_peak_sustained_active”.

It compiled with “nvcc wmma.cu --expt-relaxed-constexpr -gencode=arch=compute_75,code=“sm_75,compute_75” -o wmma” and run on a 2080Ti. The launch parameter is <<<10000, 256>>>.

I got 24.6% for

#pragma unroll
  for (int i = 0; i < 200; i++) {
    wmma::fragment<wmma::matrix_b, 16, 16, 16, __half, wmma::col_major> b_frag;
    wmma::load_matrix_sync(b_frag, B, 16);
    wmma::mma_sync(acc_frag, a_frag, b_frag, acc_frag);
  }

I got 50% for

  wmma::fragment<wmma::matrix_b, 16, 16, 16, __half, wmma::col_major> b_frag;
  wmma::load_matrix_sync(b_frag, B, 16);
#pragma unroll
  for (int i = 0; i < 200; i++) {
    wmma::mma_sync(acc_frag, a_frag, b_frag, acc_frag);
  }

And I am looking at “sm__inst_executed_pipe_tensor_op_.hmma.avg.pct_of_peak_sustained_active”.

That would explain this I think. Nsight Compute has two types of pipe utilization metrics (the ones I mentioned in the first reply). In current versions, they are mixed in the Compute Workload Analysis chart, which is not ideal. In the next version, the tool will have two separate charts (one for inst_executed and one for cycles_active metrics), as these are measuring different things and shouldn’t be compared 1:1.

Cycles Active
Pipeline utilization based on the number of cycles the pipeline was active. This takes the rates of different instructions executing on the pipeline into account. For an instruction requiring 4 cycles to complete execution, the counter is increased by 1 for 4 cycles.

Inst Executed
Pipeline utilization based on the number of executed instructions. This does not account for any variation in instruction latencies for this pipeline. For an instruction requiring 4 cycles to complete execution, the counter is increased by 1 only.

As you can maybe see from the descriptions, inst_executed only looks at how many instructions are issued, but not at their latencies. If the instruction has non-negligible latency, the metric will never reach 100%. Cycles active on the other hand takes this into account. Seeing both side-by-side is ideal, as it indicates not only how much the pipeline is utilized, but also if it’s utilized by many short and few long instructions.

I would therefore recommend that you also collect

sm__pipe_tensor_cycles_active.avg.pct_of_peak_sustained_active

to get a better idea. Even though it’s not yet in the chart, the metric can still be collected in your version of Nsight Compute.

Thanks! This answers my question.

I have new questions. I notice that using float format accumulator results in half bandwidth than half format accumulator. Is this correct? It also halved the “sm__pipe_tensor_cycles_active.avg.pct_of_peak_sustained_active” metrics. It is quite confusing. Why would this happen?

Another question is that TU102’SM has 2 tensor core per SMSP, while I notice 1 warp per SMSP (i.e. 4 warps per SM) can achieve 100% sm__pipe_tensor_cycles_active in another experiment. Why would this happen?