High CPU and low GPU utilization on Ubuntu 18.04, RTX2080. How to improve GPU utilization?

Please provide complete information as applicable to your setup.

• Hardware Platform (GPU RTX2080)
• DeepStream Version 5.0
• TensorRT Version
• NVIDIA GPU Driver Version (44.64.00, CUDA 10.2)

We have built the sample apps on a Ubuntu 18.04 x86_64 machien with RTX2080.
While running the samples under Nvidia provided docker, the performance is good (FPS, %CPU and %GPU are satisfactory). However, when these are built and executed on the host Ubuntu system, the performance is poor (high CPU utilization upto 60-70% and slow rendering of frames on VLC, system response is also poor).

Is there any specific set of environmental variables to be set to get better response?

The sample below is representative and not to be assumed as satisfactory performance:
------------------------------------------------------------------------------+
| NVIDIA-SMI 440.64.00 Driver Version: 440.64.00 CUDA Version: 10.2 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M | Bus-Id Disp.A| Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 208… On | 00000000:09:00.0 Off | N/A |
| 52% 71C P2 120W / 250W | 2770MiB / 11018MiB | 12% Default |
----------------------------------±---------------------±---------------------+

------------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 2220 G /usr/lib/xorg/Xorg 18MiB |
| 0 2358 G /usr/bin/sddm-greeter 33MiB |
| 0 6458 C ./deepstream-app 2576MiB |
------------------------------------------------------------------------------+

does deepstream-app run with the same configure file ?

Hi,

I ran the deepstream-app with the default config files on native Ubuntu 18.04 and on the Docker running on the same machine again for two different config files:
source4_1080p_dec_infer-resnet_tracker_sgie_tiled_display_int8.txt
source30_1080p_dec_infer-resnet_tiled_display_int8.txt

The performance difference between the native Ubuntu and Docker is seen. However, the difference in CPU and GPU utilization is marginal )if not huge). However, the difference in GPU utilization for Source4 and Sourc30 config files is huge on the same environment (either native Ubuntu or Docker). I am not sure if this due to number of sources OR due to the type of interences these two config files are catering to or it it due to both of these reasons? Also, is the CPU utilization is high (when observed using ‘top’). Is it correct to assume that it is because the CPU is busy reading the files and feeding the GPU or is there any reason?
These seem to be basic questions as we are trying to understand the possible scenarios so that we can configure the system optimally. If you can provide some inputs, it will be helpful.

deepstream-app built on native UBUNTU

Results of using source4_1080p_dec_infer-resnet_tracker_sgie_tiled_display_int8.txt
±----------------------------------------------------------------------------+
| NVIDIA-SMI 440.64.00 Driver Version: 440.64.00 CUDA Version: 10.2 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 208… On | 00000000:09:00.0 Off | N/A |
| 49% 69C P2 126W / 250W | 1379MiB / 11018MiB | 39% Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 2220 G /usr/lib/xorg/Xorg 18MiB |
| 0 2358 G /usr/bin/sddm-greeter 33MiB |
| 0 3546 C deepstream-app 1184MiB |
±----------------------------------------------------------------------------+

**PERF: 188.18 (166.42) 188.18 (166.42) 188.18 (166.42) 188.18 (166.42)
**PERF: 187.08 (166.17) 187.08 (166.17) 187.08 (166.17) 187.08 (166.17)
**PERF: 109.96 (164.84) 109.96 (164.84) 109.96 (164.84) 109.96 (164.84)
**PERF: 187.05 (165.56) 187.05 (165.56) 187.05 (165.56) 187.05 (165.56)
**PERF: 109.18 (163.78) 109.18 (163.78) 109.18 (163.78) 109.18 (163.78)
**PERF: 187.10 (164.49) 187.10 (164.49) 187.10 (164.49) 187.10 (164.49)
**PERF: 116.87 (163.08) 116.87 (163.08) 116.87 (163.08) 116.87 (163.08)
**PERF: 115.72 (161.72) 115.72 (161.72) 115.72 (161.72) 115.72 (161.72)
**PERF: 109.78 (160.26) 109.78 (160.26) 109.78 (160.26) 109.78 (160.26)
**PERF: 185.02 (160.94) 185.02 (160.94) 185.02 (160.94) 185.02 (160.94)
**PERF: 107.01 (159.50) 107.01 (159.50) 107.01 (159.50) 107.01 (159.50)
**PERF: 185.10 (160.16) 185.10 (160.16) 185.10 (160.16) 185.10 (160.16)
**PERF: 104.13 (158.75) 104.13 (158.75) 104.13 (158.75) 104.13 (158.75)

Results of using source30_1080p_dec_infer-resnet_tiled_display_int8.txt :
±----------------------------------------------------------------------------+
| NVIDIA-SMI 440.64.00 Driver Version: 440.64.00 CUDA Version: 10.2 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 208… On | 00000000:09:00.0 Off | N/A |
| 51% 70C P2 123W / 250W | 2770MiB / 11018MiB | 12% Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 2220 G /usr/lib/xorg/Xorg 18MiB |
| 0 2358 G /usr/bin/sddm-greeter 33MiB |
| 0 3680 C deepstream-app 2576MiB |
±----------------------------------------------------------------------------+
**PERF: 24.95 (21.71) 24.95 (21.76) 24.95 (21.74) 24.72 (21.70) 24.72 (21.74) 25.19 (21.74) 24.95 (21.69) 25.18 (21.67) 24.95 (21.74) 24.72 (21.73) 24.95 (21.74) 24.95 (21.76) 24.95 (21.71) 24.95 (21.72) 25.18 (21.71)24.95 (21.74) 24.95 (21.75) 24.72 (21.68) 24.72 (21.70) 24.95 (21.71) 24.95 (21.71) 24.72 (21.68) 24.95 (21.74)24.95 (21.76) 24.95 (21.72) 24.72 (21.74) 24.95 (21.76) 24.95 (21.72) 24.95 (21.74) 24.95 (21.75)

**PERF: 19.44 (21.66) 19.63 (21.72) 19.48 (21.70) 19.67 (21.67) 19.48 (21.70) 19.52 (21.71) 19.48 (21.65) 19.81 (21.64) 19.48 (21.70) 19.67 (21.70) 19.67 (21.72) 19.48 (21.72) 19.63 (21.67) 19.58 (21.67) 19.48 (21.68)19.48 (21.70) 19.48 (21.71) 19.67 (21.65) 19.67 (21.67) 19.67 (21.68) 19.63 (21.67) 19.63 (21.64) 19.63 (21.70)19.48 (21.72) 19.48 (21.68) 19.48 (21.70) 19.63 (21.72) 19.48 (21.68) 19.48 (21.70) 19.67 (21.72)

**PERF: 21.22 (21.69) 21.22 (21.74) 21.19 (21.71) 21.19 (21.69) 21.00 (21.70) 21.38 (21.74) 21.38 (21.68) 21.22 (21.67) 21.19 (21.71) 21.58 (21.73) 21.19 (21.73) 21.19 (21.73) 21.03 (21.69) 21.06 (21.70) 21.19 (21.69)21.38 (21.73) 21.19 (21.72) 21.19 (21.67) 21.00 (21.68) 21.38 (21.70) 21.03 (21.69) 21.22 (21.67) 21.22 (21.72)21.58 (21.75) 21.58 (21.71) 21.00 (21.70) 21.41 (21.75) 21.38 (21.70) 21.58 (21.73) 21.19 (21.73)

Running deepstream-app on NVIDIA 5.0 DOCKER (-devel)

Results for source30_1080p_dec_infer-resnet_tiled_display_int8.txt:
±----------------------------------------------------------------------------+
| NVIDIA-SMI 440.64.00 Driver Version: 440.64.00 CUDA Version: 10.2 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 208… On | 00000000:09:00.0 Off | N/A |
| 37% 62C P2 115W / 250W | 2770MiB / 11018MiB | 13% Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 2220 G /usr/lib/xorg/Xorg 18MiB |
| 0 2358 G /usr/bin/sddm-greeter 33MiB |
| 0 4721 C deepstream-app 2576MiB |
±----------------------------------------------------------------------------+

**PERF: 24.35 (25.18) 24.55 (25.21) 24.35 (25.23) 24.75 (25.23) 24.55 (25.18) 24.55 (25.21) 24.75 (25.18) 24.55 (25.18) 24.55 (25.20) 24.55 (25.22) 24.35 (25.23) 24.75 (25.23) 24.75 (25.19) 24.55 (25.20) 24.55 (25.22) 24.55 (25.22) 24.75 (25.18) 24.95 (25.23) 24.55 (25.23) 24.55 (25.22) 24.55 (25.22) 24.55 (25.22) 24.75 (25.23) 24.55 (25.18) 24.75 (25.21) 24.75 (25.19) 24.75 (25.24) 24.55 (25.20) 24.55 (25.18) 24.55 (25.22)
**PERF: 25.43 (25.20) 25.43 (25.23) 25.43 (25.24) 25.43 (25.24) 25.43 (25.20) 25.43 (25.23) 25.43 (25.20) 25.43 (25.20) 25.43 (25.22) 25.63 (25.24) 25.43 (25.24) 25.43 (25.24) 25.43 (25.22) 25.43 (25.22) 25.63 (25.24) 25.63 (25.24) 25.43 (25.19) 25.64 (25.23) 25.43 (25.24) 25.63 (25.24) 25.43 (25.24) 25.63 (25.24) 25.43 (25.24) 25.43 (25.20) 25.43 (25.23) 25.43 (25.21) 25.43 (25.25) 25.43 (25.22) 25.63 (25.21) 25.63 (25.24)
**PERF: 25.52 (25.21) 25.52 (25.24) 25.52 (25.25) 25.32 (25.24) 25.32 (25.20) 25.52 (25.24) 25.32 (25.20) 25.52 (25.21) 25.32 (25.21) 25.12 (25.23) 25.52 (25.25) 25.32 (25.24) 25.32 (25.21) 25.52 (25.23) 25.12 (25.23) 25.32 (25.24) 25.32 (25.19) 24.91 (25.23) 25.32 (25.24) 25.32 (25.24) 25.32 (25.23) 25.32 (25.24) 25.32 (25.24) 25.52 (25.21) 25.32 (25.23) 25.32 (25.21) 25.32 (25.24) 25.32 (25.22) 25.32 (25.21) 25.12 (25.23)

Results for source4_1080p_dec_infer-resnet_tracker_sgie_tiled_display_int8.txt:
±----------------------------------------------------------------------------+
| NVIDIA-SMI 440.64.00 Driver Version: 440.64.00 CUDA Version: 10.2 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 208… On | 00000000:09:00.0 Off | N/A |
| 39% 64C P2 132W / 250W | 1381MiB / 11018MiB | 44% Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 2220 G /usr/lib/xorg/Xorg 18MiB |
| 0 2358 G /usr/bin/sddm-greeter 33MiB |
| 0 4608 C deepstream-app 1186MiB |
±----------------------------------------------------------------------------+

**PERF: 188.96 (190.24) 188.96 (190.24) 188.96 (190.24) 188.96 (190.24)
**PERF: 188.77 (190.11) 188.77 (190.11) 188.77 (190.11) 188.77 (190.11)
**PERF: 189.63 (190.08) 189.63 (190.08) 189.63 (190.08) 189.63 (190.08)
**PERF: 188.70 (189.96) 188.70 (189.96) 188.70 (189.96) 188.70 (189.96)
**PERF: 189.11 (189.91) 189.11 (189.91) 189.11 (189.91) 189.11 (189.91)
**PERF: 189.75 (189.90) 189.75 (189.90) 189.75 (189.90) 189.75 (189.90)
**PERF: 188.26 (189.79) 188.26 (189.79) 188.26 (189.79) 188.26 (189.79)

Summary:
Results for source4_1080p_dec_infer-resnet_tracker_sgie_tiled_display_int8.txt : GPU 39% to 44%
Results for source30_1080p_dec_infer-resnet_tiled_display_int8.txt : GPU 12%-13%

Apart from questions in the above post, please let us know your thoughts on:

Does this mean the GPU is better utilized with lower number of streams than when the number of streams are higher OR is the CPU is not feeding the GPU fast enough? How do we find out?

Hi,

I have provided some information in the other replies. Would it be possible to comment on these observations/questions?

Thanks.

So, there is not CPU and GPU utilization difference between native UBuntu 18.04 and docker, right?

I am not sure if this due to number of sources OR due to the type of interences these two config files are catering to or it it due to both of these reasons?

Both can affect the GPU utilization.

Is it correct to assume that it is because the CPU is busy reading the files and feeding the GPU or is there any reason?

Since CPU just reads the encoded file which size is small, so I think reading video file and feed to GPU should consume few CPU resource.
I think the continuous CUDA launch may consume some CPU resource.

Does this mean the GPU is better utilized with lower number of streams than when the number of streams are higher

source30_1080p_dec_infer-resnet_tiled_display_int8.txt runs resnet10 with batch size == 30 @ INT8 precision
source4_1080p_dec_infer-resnet_tracker_sgie_tiled_display_int8.txt runs resnet10 with batch size == 4 @ INT8 precision + resnet18 with batch size == 16 x 3 @ INT8 precision.
So, seems source4_1080p_dec_infer-resnet_tracker_sgie_tiled_display_int8.txt run more higher inference tasks on GPU and cause higher GPU loading