My project uses a TensorRT neural net, and its resource utilization seems to be unexpectedly large. Specifically, checking GPU utilization with nvidia-smi and CPU utilization with top, both CPU and GPU utilization are pretty high.
The models used were created/trained using PyTorch, exported to the ONNX format and mapped to TRT using trtexec.
The process running the model also runs CUDA kernels for other processing. A side by side comparison of the process with and without reveals the following utilization:
With model inference,
top shows CPU utilization of ~44% for the process, and nvidia-smi shows:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... Off | 00000000:61:00.0 Off | 0 |
| N/A 36C P0 84W / 300W | 9324MiB / 32768MiB | 52% Default |
| | | N/A |
Without model inference,
top shows a CPU utilization of ~5% and nvidia-smi shows:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... Off | 00000000:61:00.0 Off | 0 |
| N/A 34C P0 53W / 300W | 9312MiB / 32768MiB | 4% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
Not included here is a profile done with nsys. The profile seemed to contain CUDA kernel calls done by TensorRT. On that note I have a number of specific questions related to profiling TensorRT models:
-
Why does nsys label function calls done by TensorRT as CUDA kernel calls? My understanding of TensorRT is that all execution is done on Tensor cores and not CUDA cores.
-
Does nvidia-smi show CUDA core + Tensor core utilization % ?
-
Is there any tool that can show utilization differentiated by core type (tensor versus cuda core utilization)?
-
Most concerning is the very high CPU utilization. Is this normal for TensorRT models?
Environment
**TensorRT Version: 8.2.3
**GPU Type: Tesla V100
**Nvidia Driver Version: 11.5.119
**CUDA Version: 11.6
**Operating System + Version: CentOS 7
**Python Version (if applicable): 3,6
**PyTorch Version (if applicable): 1.10.1+cu113
**Baremetal or Container (if container which image + tag): Baremetal