Active SMs doesn't hit 100% even there are enough blocks in nsys

I’m profiling my CUDA workload on an H100 GPU using Nsight Systems, and I’ve encountered a puzzling issue with SM utilization.

In the highlighted section, my GEMM kernel launches 132 blocks and an NCCL kernel launches 24 blocks, both of which are greater than the number of SMs on the H100 (132 SMs). However, the “SMs Active” metric remains quite low during this period.

The GEMM in question is relatively small, with dimensions [2500, 5120] * [5120, 5120]. My suspicion is that each GEMM block does very little work, so CUDA may be scheduling multiple blocks onto a single SM, leading to resource underutilization

Furthermore, in a previous post, I mentioned that when NCCL and GEMM kernels overlap, the GEMM kernel can suffer from a “tail effect,” causing an extra wave of blocks and nearly doubling the execution time. To mitigate this, I tried to control the number of SMs allocated to each kernel: I set cuBLAS (for GEMM) to use 108 SMs and NCCL to use 24 SMs. In theory, since the H100 has 132 SMs, this partitioning (108 + 24 = 132) should allow for 100% SM utilization.

Despite launching enough blocks to theoretically utilize all SMs, the “SMs Active” metric plateaus at only 91%, indicating that approximately 11 SMs remain idle. This suboptimal utilization contributes to a visible long tail in the kernel execution profile.

However, when profiling with Nsight Systems, I observed that the actual “SMs Active” metric only reached about 91%, meaning around 11 SMs remained idle. This, in turn, led to another tail effect, as some blocks had to wait for resources to become available, rather than all blocks running in a single wave. For reference, in this experiment my GEMM problem size is [2560, 5120] x [5120, 27648].

My main questions are:

  1. How does CUDA actually schedule blocks to SMs when I set the SM count for cuBLAS and NCCL kernels?
    Is there a reason why not all SMs are utilized, even though the sum of assigned SMs matches the hardware total?
  2. Is there any way to guarantee that all SMs are utilized when partitioning them between kernels?
    Or are there architectural or scheduler limitations that prevent perfect partitioning?

Any insights into how CUDA schedules blocks when using SM partitioning, and why I might see less than full utilization, would be greatly appreciated!