Hi,
I want to measure the overall utilization of a GPU device which is equivalent to the overall CPU utilization reported by Widows Task Manager in the CPU world.
If the utilization is 100% in the above picture, it means all the CPU cores are running computations busily. And for a computation task running on the CPU, it wouldn’t gain performance improvement by just improving its parallelism. So, I want my measurement of GPU utilization to be 100% when all SMs inside a GPU are fully occupied, and it’s hard to improve the performance of a kernel by increasing its grid dim and block dim.
Question 1: What does the GPU utilization reported by nvidia-smi
mean?
I first tried to use nvidia-smi
. To check its applicability, I ran a simple kernel with gridDim = 1 and blockDim = 1 (see below).
__global__ void simple_kernel() {
while (true) {}
}
int main() {
simple_kernel<<<1, 1>>>();
cudaDeviceSynchronize();
}
Since simple_kernel
was launched with only one block, I expected that only one SM would be dedicated to run it. Therefore, the GPU utilization should be 1 / number_of_SMs_in_the_GPU
.
However, nvidia-smi
reported that the GPU utilization was 100%. This violates my expectation. I wonder what exactly does the GPU utilization reported by nvidia-smi
mean?
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.42.01 Driver Version: 470.42.01 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... Off | 00000000:1A:00.0 Off | 0 |
| N/A 42C P0 67W / 300W | 2602MiB / 32510MiB | 100% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
Question 2: How to determine the blockDim that achieves 100% GPU utilization?
Since a CUDA block is mapped to an SM for running, can I assume that I can launch a kernel with gridDim = number_of_SMs_in_the_GPU to achieve 100% GPU utilization if the kernel only access GPU memory to do computation?
If the kernel accesses CPU memory (pinned host memory), during the data transfer, would the SM be idle to schedule another block to run on it? If this is true, the gridDim should be greater than number_of_SMs_in_the_GPU. Then, how to choose an appropriate gridDim?