Some questions on GPU utilization

Hi,

I want to measure the overall utilization of a GPU device which is equivalent to the overall CPU utilization reported by Widows Task Manager in the CPU world.

If the utilization is 100% in the above picture, it means all the CPU cores are running computations busily. And for a computation task running on the CPU, it wouldn’t gain performance improvement by just improving its parallelism. So, I want my measurement of GPU utilization to be 100% when all SMs inside a GPU are fully occupied, and it’s hard to improve the performance of a kernel by increasing its grid dim and block dim.

Question 1: What does the GPU utilization reported by nvidia-smi mean?
I first tried to use nvidia-smi. To check its applicability, I ran a simple kernel with gridDim = 1 and blockDim = 1 (see below).

__global__ void simple_kernel() {
    while (true) {}
}

int main() {
    simple_kernel<<<1, 1>>>();
    cudaDeviceSynchronize();
}

Since simple_kernel was launched with only one block, I expected that only one SM would be dedicated to run it. Therefore, the GPU utilization should be 1 / number_of_SMs_in_the_GPU.

However, nvidia-smi reported that the GPU utilization was 100%. This violates my expectation. I wonder what exactly does the GPU utilization reported by nvidia-smi mean?

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.42.01    Driver Version: 470.42.01    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  Off  | 00000000:1A:00.0 Off |                    0 |
| N/A   42C    P0    67W / 300W |   2602MiB / 32510MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

Question 2: How to determine the blockDim that achieves 100% GPU utilization?
Since a CUDA block is mapped to an SM for running, can I assume that I can launch a kernel with gridDim = number_of_SMs_in_the_GPU to achieve 100% GPU utilization if the kernel only access GPU memory to do computation?

If the kernel accesses CPU memory (pinned host memory), during the data transfer, would the SM be idle to schedule another block to run on it? If this is true, the gridDim should be greater than number_of_SMs_in_the_GPU. Then, how to choose an appropriate gridDim?

Regarding Question 1: that utilization appears to be a short term average of the amount of time the GPU was active. It does not tell you anything about the occupancy of the hardware units. Such information could be extracted with profiling tools such as nVidia Visual Profiler or nVidia nSight.

For Question 2, have a look at the CUDA occupancy calculator spreadsheet. This allows you to make predictions about expected occupancy.

1 Like

Re. Q2, the following may be helpful:

1 Like

@cbuchner1 @rs277

Hi, thanks for your answers! After reading the materials on CUDA occupancy, I have a new question.

The CUDA C Programming Guide says:

Higher occupancy does not always equate to higher performance-there is a point above which additional occupancy does not improve performance. However, low occupancy always interferes with the ability to hide memory latency, resulting in performance degradation.

Does that mean I can always try to maximize the occupancy to ensure high performance? Or there are situations when additional occupancy may degrade performance since the involvement of additional overheads like scheduling?

I would say that high occupancy is a necessary but not a sufficient requirement to achieve high performance. There can be other factors that may hurt performance, such as highly divergent code, unfavorable memory access patterns, shared memory bank conflicts, spillage to local memory due to register pressure and other reasons…

1 Like

Not necessarily. I have a kernel that is highly performant, but has only 18% occupancy. I would not overemphasise occupancy - once stalls due to latencies are covered adequately, it’s importance decreases. Use the profiler to identify where the issues are.

This may be useful:

2 Likes