Question about NVVP results: GPU's SMs and cores during CUDA kernel execution

Hi,
I am doing some experiments with CUDA and analyse them with Nvidia visual profiler (nvvp) in order to undertsand somehow the SMs and cores utilization and scheduling during an execution of an application.

The test case I am using is a dummy kernel that just take an integer number and add it with an offset.
I execute this kernel with various threads and blocks just to see how they are mapped on the GPU hardware.

Here is some results-figures I extracted from Unguided Analysis->Kernel Latency:

  1. the first image shows the SMs used by the kernel with 1 block and 1 thread
    https://postimg.cc/hhvk60mb
  2. the second image shows the SMs used by the kernel with 1 block and 1024 threads
    https://postimg.cc/94ZJ9rsk
  3. the second image shows the SMs used by the kernel with 70 blocks and 1024 threads
    https://postimg.cc/23b37h03

My conclusion from the those results is that the number of the SMs will be used is the number of the blocks (if blocks>SMs all the SMs will be used. In my case I have 80 SMs). But I have some questions:

a) I cannot understand the utilization column. Is the percentage of the CUDA cores occupied? If yes, why when I launch 1 thread per block (case 1) I have 30% utilization and when I launch 1024 threads per block (case 2) I have the same? It does not make sence. If not, what does refer to?
b) As far as the SMs occupation concerned, when will 2 blocks will occupy the same SM? Only when there are not available SMs anymore? Generally I though that even if I launched 70 blocks, they could use only one SM. Is that case possible? If yes, which are the factors that can cause it?

I know that there are not many details about the scheduling policies available, but at first I would strongly like to acquire a general idea.

Thak you in advance!

The grid rasterization order and thread block work distribution algorithm is not specified by the CUDA programming model. A thread block will reside on 1 SM.

It is very easy to write a program that reads the PTX variables %smid, %warpid, and %globaltimer and creates a plot of the work distribution. On most GPUs the thread block distribution can be observed to be breadth first across SMs.

CUDA cores are FP32 datapath/pipelines. CUDA core utilization is reported from 0-10 using the metric single_precision_fu_utilization.

A. I am not familiar with the charts in your post. I believe utilization is defined as active_cycles_sm / elapsed_cycles_sm per SM. If each thread is only adding an offset to an integer the kernel is a NOP (no operation) kernel and the launch overhead and work distribution will far exceed the SM active cycles. Increasing the workload by adding memory loads will likely converge to 100% independent of executing FP32 instructions.

B1. This is not defined by the CUDA programming model. It can be observed that the work distribution is breadth first and that an SM will receive a second block once all other SMs have 1 block. After all SMs have work the work distribution is dynamic.

B2. The CUDA Programming Guide Section Hardware Multi-threading provides an introduction to SM Occupancy.
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#hardware-multithreading. The maximum number of thread blocks resident on an SM limited by how many resources are used by each thread block. In the case specified a blockDim = 1024 threads will limit most GPUs to either 1-2 resident blocks per SM.

The best method to learn how thread block scheduling is to author a set of kernels that

read %globaltimer // start
read %smid
read %warpid

read %globaltimer // end
outputOffset = atomicInc(pOutputBuffer, 1)
write pBuffer[outputOffset] = {start, end, smid, warpid}

and graph the data.

For I would recommend using a wait kernel similar to the CUDA sample concurrentKernels but use %globaltimer instead of clock/clock64(). By modifying the kernel or passing a variable you can vary the duration of all blocks or have each block execute a different duration. Please note that on Linux and Windows 10 with Pascal and above Compute Instruction Level Pre-emption (CILP) is enabled. If there is another process trying to use the GPU the kernel will context switch approximately every 2 ms.