Visual profiler dont display the true SM's utilization


I’m trying to learn how the GPU splits the computation between the multiprocessors.

I wrote a program with two kernels which run concurrently and the Visual Profiler showed me the following results:

  • the kernels indeed runs concurrently (figure 1)

  • the kernel ker_1 SM’s distribution in figure 2:

  • the kernel ker_2 SM’s distribution in figure 3:

Meaning that together the kernels occupy more then 100% of the GPU SMs, which is impossible because the kernels runs concurrently.

Furthermore, ker_2 has 32 blocks of 265 threads per block. According to the profiler every thread uses 38 registers so ker_2 is supposed to occupy about 50% of the hardware resources and not 100% as figure 3 claims (see figure 4)

So how can I see the real distribution between the SMs?


Nsight Compute, Visual Profiler, and nvprof serialize kernel launches when collecting performance counters. Nsight Systems 2021.2 GPU metrics feature samples counters without serialize the counters. The downside is that there is no attribution of counters to the concurrently executing counters.