I am doing some experiments with CUDA and analyse them with Nvidia visual profiler (nvvp) in order to undertsand somehow the SMs and cores utilization and scheduling during an execution of an application.
The test case I am using is a dummy kernel that just take an integer number and add it with an offset.
I execute this kernel with various threads and blocks just to see how they are mapped on the GPU hardware.
Here is some results-figures I extracted from Unguided Analysis->Kernel Latency:
- the first image shows the SMs used by the kernel with 1 block and 1 thread
- the second image shows the SMs used by the kernel with 1 block and 1024 threads
- the second image shows the SMs used by the kernel with 70 blocks and 1024 threads
My conclusion from the those results is that the number of the SMs will be used is the number of the blocks (if blocks>SMs all the SMs will be used. In my case I have 80 SMs). But I have some questions:
a) I cannot understand the utilization column. Is the percentage of the CUDA cores occupied? If yes, why when I launch 1 thread per block (case 1) I have 30% utilization and when I launch 1024 threads per block (case 2) I have the same? It does not make sence. If not, what does refer to?
b) As far as the SMs occupation concerned, when will 2 blocks will occupy the same SM? Only when there are not available SMs anymore? Generally I though that even if I launched 70 blocks, they could use only one SM. Is that case possible? If yes, which are the factors that can cause it?
I know that there are not many details about the scheduling policies available, but at first I would strongly like to acquire a general idea.
Thak you in advance!