Some questions on GPU utilization

huaouo · October 1, 2021, 10:55am

Hi,

I want to measure the overall utilization of a GPU device which is equivalent to the overall CPU utilization reported by Widows Task Manager in the CPU world.

If the utilization is 100% in the above picture, it means all the CPU cores are running computations busily. And for a computation task running on the CPU, it wouldn’t gain performance improvement by just improving its parallelism. So, I want my measurement of GPU utilization to be 100% when all SMs inside a GPU are fully occupied, and it’s hard to improve the performance of a kernel by increasing its grid dim and block dim.

Question 1: What does the GPU utilization reported by nvidia-smi mean?
I first tried to use nvidia-smi. To check its applicability, I ran a simple kernel with gridDim = 1 and blockDim = 1 (see below).

__global__ void simple_kernel() {
    while (true) {}
}

int main() {
    simple_kernel<<<1, 1>>>();
    cudaDeviceSynchronize();
}

Since simple_kernel was launched with only one block, I expected that only one SM would be dedicated to run it. Therefore, the GPU utilization should be 1 / number_of_SMs_in_the_GPU.

However, nvidia-smi reported that the GPU utilization was 100%. This violates my expectation. I wonder what exactly does the GPU utilization reported by nvidia-smi mean?

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.42.01    Driver Version: 470.42.01    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  Off  | 00000000:1A:00.0 Off |                    0 |
| N/A   42C    P0    67W / 300W |   2602MiB / 32510MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

Question 2: How to determine the blockDim that achieves 100% GPU utilization?
Since a CUDA block is mapped to an SM for running, can I assume that I can launch a kernel with gridDim = number_of_SMs_in_the_GPU to achieve 100% GPU utilization if the kernel only access GPU memory to do computation?

If the kernel accesses CPU memory (pinned host memory), during the data transfer, would the SM be idle to schedule another block to run on it? If this is true, the gridDim should be greater than number_of_SMs_in_the_GPU. Then, how to choose an appropriate gridDim?

cbuchner1 · October 1, 2021, 12:03pm

Regarding Question 1: that utilization appears to be a short term average of the amount of time the GPU was active. It does not tell you anything about the occupancy of the hardware units. Such information could be extracted with profiling tools such as nVidia Visual Profiler or nVidia nSight.

For Question 2, have a look at the CUDA occupancy calculator spreadsheet. This allows you to make predictions about expected occupancy.

rs277 · October 1, 2021, 6:50pm

Re. Q2, the following may be helpful:

huaouo · October 8, 2021, 12:04pm

@cbuchner1 @rs277

Hi, thanks for your answers! After reading the materials on CUDA occupancy, I have a new question.

The CUDA C Programming Guide says:

Higher occupancy does not always equate to higher performance-there is a point above which additional occupancy does not improve performance. However, low occupancy always interferes with the ability to hide memory latency, resulting in performance degradation.

Does that mean I can always try to maximize the occupancy to ensure high performance? Or there are situations when additional occupancy may degrade performance since the involvement of additional overheads like scheduling?

cbuchner1 · October 8, 2021, 1:02pm

I would say that high occupancy is a necessary but not a sufficient requirement to achieve high performance. There can be other factors that may hurt performance, such as highly divergent code, unfavorable memory access patterns, shared memory bank conflicts, spillage to local memory due to register pressure and other reasons…

rs277 · October 8, 2021, 9:17pm

Not necessarily. I have a kernel that is highly performant, but has only 18% occupancy. I would not overemphasise occupancy - once stalls due to latencies are covered adequately, it’s importance decreases. Use the profiler to identify where the issues are.

This may be useful:

Topic		Replies	Views
How to evaluate if a kernel fully utilizes GPU? CUDA Programming and Performance	2	343	March 27, 2024
How to tune the SM utilization (across the entire GPU) of a CUDA kernel? CUDA Programming and Performance cuda , kernel , ubuntu	4	1251	July 23, 2023
Question about NVVP results: GPU's SMs and cores during CUDA kernel execution CUDA Programming and Performance	1	1271	April 13, 2019
False information from occupancy calculator? CUDA Programming and Performance	1	705	February 2, 2018
How to profile overall SM utilization of the program by Nsight Compute? Nsight Compute	9	2557	July 27, 2023
Theoretical occupancy less than 100% CUDA Programming and Performance	5	1970	October 12, 2021
The GPU concurrentcy and how to monitor GPU utilization. The nvidia-smi tool always show two utilization 0 or 100%. CUDA Programming and Performance	2	3518	June 1, 2017
Utilization of Streaming Multiprocessors (GPU) while running Games CUDA Programming and Performance	1	476	July 14, 2017
Questions about nvidia-smi CUDA Programming and Performance	2	2118	February 23, 2011
How to monitor SM utilization and SM occupancy? System Management and Monitoring (NVML)	7	12626	January 12, 2024

Some questions on GPU utilization

Related topics